The successful completion of ‘Human Genome Project’ (HGP) and its availability as an ‘Open Source Resource’ is arguably the most important advancement in the history of biotechnology and medicinal research in recent times. A number of databases have been reported in the last decade with different aspects of genomic data, including single nucleotide polymorphisms (SNPs), gene expression, protein-protein interactions and more. In this mini-review, we attempt to provide a brief outline of the resources developed after the public domain appearance of the HGP.
1. Human genome project
The Human Genome Project (HGP) was successfully completed by International Human Genome Sequencing consortium in 2003. One of the major principles adopted by the consortium was the public availability of the data generated. Hence, the human genome sequence is freely available in the Genome database maintained by National Center of Biotechnology Information (http://www.ncbi.nlm.nih.gov/guide/genomes-maps/). In the first draft, nearly 30,000 to 40,000 protein coding genes were predicted in the human genome, however, the number was later reduced to less than 21,000 genes.1 The results also highlighted the identification of more than 1.4 million Single Nucleotide Polymorphisms (SNPs). The completion of this project has also led to the development of several new initiatives. Some of the important initiatives, their availability, and history are presented in this short review.
2. Derivatives of human genome project
The HGP not only opened the floodgates for the analysis of genomic data from different perspectives but also the formation of new directives for a better understanding of its structure and functioning. HGP was soon followed by the development of the International HapMap Project2, ENCyclopedia of DNA Elements (ENCODE) 3, etc. Additionally, ENSEMBL4 and the UCSC Genome Browser 5 were published to handle the data generated by HGP and analyze it for useful inferences.6
2.1 International HapMap project
The project was established in 2002 with an aim to determine the common patterns among the variations in terms of sequence throughout the human genome.2 Over the years, it has been frequently updated; a second version of the HapMap database which includes more than three million SNPs in four geographically distinct populations are available now. The SNP density of this version is about one per kilo base of sequence and is estimated to contain 25-35 percent of the anticipated 9-10 million SNPs in the human genome.7
2.2ENSEMBL and UCSC genome browser
These databases provide the stable automated annotations for the human genome sequences. ENSEMBL was also a major contributor for deducing the analysis published with the first draft of HGP. The UCSC genome browser has a similar design to ENSEMBL and was primarily designed to support the data generated by HGP. Since their conception, the aim of such data bases has been to deliver the details regarding the sequence in a systematic manner, while also accommodating the framework for data analysis.
2.3 ENCODE project
The ENCyclopedia of DNA Elements Project3 has an aim of identifying all functional elements, including regions of transcription, transcription factor binding regions and chromatin structures in the human genome.
The project was initiated on a pilot scale with a target sequence of 30 megabases in 2004 3, which was successfully accomplished in 2007 and reported functional elements in this 1% of the genome.8 Presently ENCODE database contains functional characterization data of about 80 percent of the human genome 9 and can be freely accessed at http://genome.ucsc.edu/encode/.
2.4 The personal genomes project
The concept and necessity of The Personal Genomes Project was presented in 2005 by George Church, he conceived this project as the natural successor of HGP.10 The project targets the creation of a scientific platform for integration of human genomic, trait and comprehensive environmental data. Such integrated datasets can be deemed essential for the development of functional genomics, providing holistic insights to deduce the underlying mechanisms of human health and diseases.11 The dataset can be accessed through the web-site http://www.personalgenomes.org.
The Gene Ontology project was propelled with an aim of developing a platform for structured representation of gene functions and their products in an organism.12 The genes and products are categorized on the basis of their involvement in a cellular process, molecular function and cellular component to which they belong.12, 13 A number of tools are available on the web-server, which allows the user to extract ontologies for a list of genes.12 The utility of such a project is important because information regarding the function of a protein in one organism can lead to essential inferences to its role in other ones. The data can be accessed at: http://www.geneontology.org/. Several other similar resources were published as derivative resources to the HGP, but an extensive discussion is beyond the scope of this mini-review.
The post genomic era witnessed the development of many useful tools and databases for the analysis and storage of different forms of genomic data. The advent of high-throughput techniques like microarrays has enabled the scientific community to develop new methodologies for gathering information and analyze the data generated by them. We have broadly categorized such resources on the basis of data availability and functionality into three distinct areas.
3.1Protein-protein interaction databases
The significance of such resources for genomic analysis was clearly outlined in the initial sequence and analysis projections of HGP in important journals.6, 14 Various cellular processes are governed by molecular interactions of different entities, mostly proteins.15,16 The development of two-hybrid systems 17 and tandem affinity purification 18 techniques have helped the researchers to record large number of interactions in a single experiment. However, such techniques are also prone to high error rates19, which demands intervention of specialized computational tools and methodologies.20The development of submission guidelines for such resources by the International Molecular Exchange consortium21, especially, ‘Minimum Information’ required for reporting a molecular interaction,22 has greatly influenced the quality of information available in such databases.23 These resources may be segregated as experimentally and computationally derived interaction resources. The experimentally derived resources include IntAct23, DIP24, BioGRID19, MINT15, HPRD25 and MIPS.26 The computationally derived databases include HomoMINT27, OPHID28, PIPs29, STRING30 and PrePPI.31 The details regarding the availability and source of these resources are given in table 1.