Human Genome

By 1981, 579 human genes had been mapped and mapping by in situ hybridization had become a standard method. Marvin Carruthers and Leory Hood made a huge leap in bioinformatics when they invented a method for automated DNA sequencing. In 1988, the Human Genome organization (HUGO) was founded. This is an international organization of scientists involved in Human Genome Project. In 1989, the first complete genome map was published of the bacteria Haemophilus influenza. Summary: Origin of bioinformatics database, origin of bioinformatics tools, and A Chronological History of events

The following year, the Human Genome Project was started. By 1991, a total of 1879 human genes had been mapped. In 1993, Genethon, a human genome research center in France Produced a physical map of the human genome. Three years later, Genethon published the final version of the Human Genetic Map. This concluded the end of the first phase of the Human Genome Project.

Origin of Biological and Molecular databases

The first molecular biology and biological databases were constructed a few years after the first protein sequences began to become available. The first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues. Nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine tRNA with 77 bases. Just a year later, Day off gathered all the available sequence data to create the first bioinformatic database. The Protein Data Bank followed in 1972 with a collection of ten X-ray crystallographic protein structures, and the SWISSPROT protein sequence database began in 1987. A huge variety of divergent data resources of different types and sizes are now available either in the public domain or more recently from commercial third parties. All of the original databases were organized in a very simple way with data entries being stored in flat files, either one perentry, or as a single large text file. Re-write - Later on lookup indexes were added to allow convenient keyword searching of header information.

Origin of tools

After the formation of the databases, tools became available to search sequence databases - at first in a very simple way, looking for keyword matches and short sequence words, and then more sophisticated pattern matching and alignment based methods. The rapid but less rigorous BLAST algorithm has been the mainstay of sequence database searching since its introduction a decade ago, complemented by the more rigorous and slower FASTA and Smith Waterman algorithms. Suites of analysis algorithms, written by leading academic researchers at Stanford, CA, Cambridge, UK and Madison, WI for their in-house projects, began to become more widely available for basic sequence analysis. These algorithms were typically single function black boxes that took input and produced output in the form of formatted files. UNIX style commands were used to operate the algorithms, with some suites having hundreds of possible commands, each taking different command options and input formats.

Since these early efforts, significant advances have been made in automating the collection of sequence information. Rapid innovation in biochemistry and instrumentation has brought us to the point where the entire genomic sequence of at least 20 organisms, mainly microbial pathogens, are known and projects to elucidate at least 100 more prokaryotic and eukaryotic genomes are currently under way. Groups are now even competing to finish the sequence of the entire human genome. With new technologies we can directly examine the changes in expression levels of both mRNA and proteins in living cells, both in a disease state or following an external challenge. We can go on to identify patterns of response in cells that lead us to an understanding of the mechanism of action of an agent on a tissue. The volume of data arising from projects of this nature is unprecedented in the pharma industry, and will have a profound effect on the ways in which data are used and experiments performed in drug discovery and development projects.

This is true not least because, with much of the available interesting data being in the hands of commercial genomics companies, pharmcos are unable to get exclusive access to many gene sequences or their expression profiles. The competition between co-licensees of a genomic database is effectively a race to establish a mechanistic role or other utility for a gene in a disease state in order to secure a patent position on that gene. Much of this work is carried out by informatics tools. Despite the huge progress in sequencing and expression analysis technologies, and the corresponding magnitude of more data that is held in the public, private and commercial databases, the tools used for storage, retrieval, analysis and dissemination of data in bioinformatics are still very similar to the original systems gathered together by researchers 15-20 years ago.