Bioinformatics is usually defined as the use of mathematical, statistical, and computer science methods for solving biological problems and for analyzing data collected by biological and medical science. Bioinformatics includes the collection, storage, analysis, and interpretation of DNA and protein sequence information, as well as the collection and analysis of patient statistics, data from tissue specimens, and results of clinical trials.
The overall purpose of bioinformatics is to provide new insights into biological processes and identify unifying biological principles, using data from genomics, proteomics, metabolomics, and microbiomics. Genomics is the determination and analysis of the nucleic acid (DNA and RNA) sequences that are inherited by all cells and organisms and that direct biological processes. Genomics includes both the DNA sequences of genes that code for the amino acid sequences of proteins and the non-protein-coding DNA sequences that have a variety of other functions. Proteomics is the determination and analysis of amino acid sequences and modifications of proteins in cells, tissues, and organisms. Metabolomics is the identification and analysis of all of the molecules involved in the metabolism of a cell, tissue, or organism. Microbiomics is the identification and analysis of all of the microorganisms in an ecological niche, such as the human gut, the deep-sea floor, or a soil sample. In recent decades, the amounts of this biological data have become so immense that the most sophisticated computers and software are required to collect, sort, and analyze it in meaningful ways.
The many uses of bioinformatics include:
To a significant extent, bioinformatics has moved biology from a purely laboratory science to an information technology (IT)-based science. This process of analyzing and interpreting biological data is also referred to as computational biology. Scientists generally consider bioinformatics to be synonymous with computational molecular biology. The driving force behind the science of bioinformatics has been the development of high-throughput DNA sequencing—the rapid sequencing of entire genomes by sequencing small DNA fragments and using very powerful computer systems to assemble the small sequences in the correct order to obtain the entire genomic sequences of cells and organisms.
Biological data is particularly well-suited to IT methodologies, because most large biological molecules—biomolecules—are polymers or chains simpler molecules strung together in a meaningful order. The nucleic acids, DNA and RNA, are chains of nucleotide bases designated by the letters A, G, C, and T or U. Proteins are chains of amino acids, with each amino acid coded for by three of the nucleotide letters in a specific order. Thus, biomolecules are equivalent to strings of information or data.
The term bioinformatics was first used in the 1980s to refer to the analysis of biological sequence data. However, scientists have been analyzing DNA and protein sequences and developing databases and algorithms (mathematical equations) to analyze those sequences since the 1960s. Before the development of computer-driven sequencing methods, sequences were determined and analyzed manually or with the use of personal computers. These methods are referred to as pre-genomic or classical bioinformatics.
Before modern bioinformatics, locating and studying genes involved producing mutations that had quantifiable effects in an organism (in vivo) or isolating DNA encoding a gene and studying it in a test tube (in vitro). Analyzing genes with bioinformatics using a computer is referred to as “in silico.”
Bioinformatics came into its own with the HGP. Initiated in 1990, the HGP was an international effort to sequence the entire human genome—three billion bases of DNA—and map all of the genes that coded for proteins. Over the course of the HGP, every aspect of DNA sequencing was vastly improved, primarily as a result of advances in bioinformatics. It took the HGP about four years to sequence the first one billion bases and just four months to sequence the second billion. The HGP was completed in April of 2003, years ahead of schedule. In the single month of January 2003, 1.5 billion bases were sequenced. Simultaneously, the genomes of many other organisms were completely sequenced. As sequencing speed increased, the cost fell, from $1 per base in 1990 to 10¢ per base in April of 2003.
One of the first steps in locating genes is to look for DNA sequences that encode the amino acids of a protein in a single stretch called an open reading frame (ORF). Computer programs are used to identify ORFs. One of the biggest surprises of the HGP was that the human genome contained only about 21,000 genes, about one-fifth of the predicted number and only about 25% more than are present in very simple organisms. The HGP revealed that many genes are cut and spliced or differentially regulated to produce multiple proteins. Bioinformatics is essential for predicting these multiple gene products.
With the introduction of next-generation sequencing, bioinformatics has advanced DNA sequencing to an industrial scale, in which entire human genomes are sequenced in a matter of days. Sequencing continues to evolve rapidly, presenting many new challenges for bioinformatics. Since very short fragments of DNA are sequenced individually, bioinformatics is required to assemble these short overlapping sequences in the correct order. This is a particular problem with higher organisms, because their genomes contain many identical repeated sequences. Furthermore, bioinformatics must account for sequencing errors and the many differences in DNA sequences between individuals of a species. Thus, correct sequence assembly is a central focus of bioinformatics.
Most often, rather than sequencing entire genomes (all of the DNA in a cell or organism), only specific regions of interest are sequenced. Bioinformatics enables the sequencing of individual genes or only the 1–2% of the genome that contains genes that code for proteins. Copy number variations—the number of copies of a particular gene or sequence present in an individual's genome—are very important for some human diseases and also for studying individual responses to various drugs—a field known as pharmacogenomics. Specific single base changes in DNA sequences—called single nucleotide polymorphisms or SNPs—are also very important for some diseases and responses to drugs, as well as for identifying individuals. Thus far, SNPs have been used to account for 30–50% of individual variations in drug responses. SNPs are also the genetic variations used by forensic scientists for DNA fingerprinting.
Bioinformatics is even more important for analyzing and interpreting the massive amounts of sequence data. By August of 2005, the three largest public-access databases of DNA and RNA sequences were storing 100 billion bases from 165,000 different organisms; yet, the functions of almost half of the genes identified by the HGP had not yet been determined.
Bioinformatics has become an essential component of the daily work of many, if not most, biological and medical laboratories. The many applications of bioinformatics include:
Bioinformatics also has many applications in clinical medicine, particularly in the identification of genes and gene mutations that are involved in various types of cancer. Clinical proteomics involves identifying protein biomarkers in blood or tissues for diagnosis of diseases and measuring responses to treatment. Research or medical informatics is the management of experimental or patient clinical data.
Bioinformatics has applications to many fields other than molecular biology. Structural biologists use bioinformatics to analyze the huge amount of complex data that comes from x-ray studies of protein crystals, from nuclear magnetic resonance studies, and from electron microscopy. Neuroscientists use bioinformatics to study neural networks—connections between brain cells and nerves throughout the body. Computer scientists use bioinformatics in the development of artificial intelligence.
There are many bioinformatics tools for working with databases. Genome mapping programs localize genes and sequences on chromosomes. The BLAST program—the Basic Local Alignment Search Tool—finds DNA sequences that are similar. The CLUSTAL program is used to compare “clusters” of sequences, aligning them according to their similarities and differences. CLUSTAL alignments are used to compare genes between individuals within a population, to study evolutionary relationships among species, and to identify mutations associated with disease.
DNA sequence data is overwhelming the ability of bioinformatics to store, share, and, especially, to analyze the data. As of 2012, worldwide capacity for DNA sequencing was 13 quadrillion bases per year. Within a few years, millions of human genomes are projected to have been sequenced, in addition to RNA and protein sequences and huge amounts of data on other molecules in cells. A single laboratory generated 60 billion bases of microbial sequence—the equivalent of 20 human genomes—from just two surface seawater samples. The sequencing was completed in weeks, but it took almost two years to analyze the data. As of 2012, the Human Microbiome Project, which is sequencing the microbial composition of the human gastrointestinal tract, had generated about one million times as much data as a single human genome. The need for analysis has spurred the formation of bioinformatics companies and a huge demand for scientists trained in bioinformatics.
Although bioinformatics has not yet reached its goal of true personalized medicine, in which sequencing individual genomes can predict, prevent, diagnose, and treat human disease, there have been notable successes. Whole-genome sequencing has been used to guide the treatment and management of certain leukemias and some rare forms of cancer. Bioinformatics has been used to completely sequence the genomes of early humans and Neandertals. In 2012, scientists completely sequenced the genome of a Denisovan girl who lived in Siberia 50,000 years ago. This small population of proto-humans is known only from a single tiny finger bone and two teeth, but the genome revealed that Denisovans were closely related to Neandertals and interbred with ancestors of some living humans, and that the girl had brown hair, eyes, and skin. Also in 2012, an analysis comparing 364,470 distinct SNPs revealed that the New World was populated by three separate waves of migration from Siberia, beginning at least 15,000 years ago.
Fitzgerald-Hayes, Molly, and Frieda Reichsman. DNA and Biotechnology. 3rd ed. Burlington, MA: Academic Press/Elsevier, 2010.
Hodgman, T. Charlie, Andrew French, and David. R. Westhead. Bioinformatics. 2nd ed. New York: Taylor & Francis, 2010.
Kelling, Steve. “Using Bioinformatics in Citizen Science.” In Citizen Science: Public Participation in Environmental Research, edited by Janis L. Dickinson and Rick Bonney. Ithaca, NY: Comstock, 2012.
November, Joseph Adam. Biomedical Computing: Digitizing Life in the United States. Baltimore, MD: Johns Hopkins University, 2012.
St. Clair, Caroline, and Jonathan Visick. Exploring Bioinformatics: A Project-Based Approach. Sudbury, MA: Jones and Bartlett, 2010.
Baker, Monya. “The Changes That Count.” Nature 482, no. 7384 (February 9, 2012): 257, 259–62.
Baker, Monya. “Gene Data to Hit Milestone.” Nature 487, no. 7407 (July 19, 2012): 282–3.
Benskin, Jon, and Sixue Chen. “Proteomics in the Classroom: An Investigative Study of Proteins in Microorganisms.” American Biology Teacher 74, no. 4 (April 2012): 237–43.
Capriotti, Emidio, et al. “Bioinformatics for Personal Genome Interpretation.” Briefings in Bioinformatics 13, no. 4 (July 2012): 495.
Drew, Joshua. “The Role of Natural History Institutions and Bioinformatics in Conservation Biology.” Conservation Biology 25, no. 6 (December 2011): 1250–52.
Eurich, Chris, Peter A. Fields, and Elizabeth Rice. “Proteomics: Protein Identification Using Online Databases.” American Biology Teacher 74, no. 4 (April 2012): 250–5.
Hotz, Robert Lee. “Early Americans Arrived in Three Waves.” Wall Street Journal (July 12, 2012): A3.
Mestel, Rosie, and Eryn Brown. “Studies Give a Deeper View of DNA: Findings from the ENCODE Project Renew Hopes for Personalized Medicine.” Los Angeles Times (September 6, 2012): A1.
Pollack, Andrew. “A Genome Deluge.” New York Times (December 1, 2011): B1.
Qadir, Zara. “Genomic Future Beckons for Cancer Management.” Lancet 378, no. 9806 (November 26–December 2, 2011): 1838.
“Bioinformatics.” Bioinformatics Organization. November 17, 2011. http://www.bioinformatics.org/wiki/Bioinformatics (accessed October 13, 2012).
“Bioinformatics.” Understanding the Human Genome Project. National Human Genome Research Institute. November 24, 2010. http://www.genome.gov/25019999 (accessed October 13, 2012).
“Bioinformatics FAQ.” Bioinformatics Organization. November 24, 2010. http://www.bioinformatics.org/wiki/Bioinformatics_FAQ (accessed October 13, 2012).
Gibbons, Ann. “Genome Brings Ancient Girl to Life.” Science Now. August 30, 2012. http://news.sciencemag.org/sciencenow/2012/08/genome-brings-ancient-girl-to-li.html?ref=hp (accessed October 13, 2012).
“What is Cancer Proteomics?” Office of Cancer Clinical Proteomics Research. National Cancer Institute. http://proteomics.cancer.gov/whatisproteomics (accessed October 12, 2012).
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, CB10 1SD, 44 (0)1223 494 444, Fax: 44 (0)1223 494 468, http://www.ebi.ac.uk .
The EBI is a center for research and services in bioinformatics and part of the European Molecular Biology Laboratory. It manages databases of biological data including nucleic acid and protein sequences and macromolecular structures.
National Center for Biotechnology Information, National Library of Medicine Building 38A, Bethesda, MD, USA, 20894, 1(301) 496-2475, email@example.com, http://www.ncbi.nlm.nih.gov .
The NCBI is the national resource for molecular biology information and develops new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. The NCBI creates automated systems for storing and analyzing molecular biology, biochemistry, and genetics data, facilitates the use of databases and software by the research and medical communities, coordinates efforts to gather biotechnology information nationally and internationally, and carries out research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules.
National Human Genome Research Institute, Communications and Public Liaison Branch, National Institutes of Health, Building 31, Room 4B09, 31 Center Drive, MSC 2152, 9000 Rockville Pike, Bethesda, MD, USA, 20892-2152, 1(301) 402-0911, Fax: 1(301) 402-2218, http://www.genome.gov .
The NHGRI led the National Institutes of Health's contribution to the International Human Genome Project, which had as its primary goal the sequencing of the human genome. This project was successfully completed in April of 2003. The NHGRI's mission has now been expanded to encompass a broad range of studies aimed at understanding the structure and function of the human genome and its role in health and disease.
Office of Cancer Clinical Proteomics Research, Center for Strategic Scientific Initiatives, Office of the Director, National Cancer Institute, 31 Center Drive, MS 2580, Bethesda, MD, USA, 20892-2580, 1(301) 451-8883, firstname.lastname@example.org, http://proteomics.cancer.gov .
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a comprehensive, coordinated effort to understand the molecular basis of cancer through the application of quantitative, proteomic technologies. The CPTAC Data Portal is the host for all the data produced by the consortium. As of 2012, the total amount of data exceeded over 500 gigabytes of raw data in over 800 files.
Margaret Alic, PhD