Bioinformatics is the utilization of computer software to study biological information and methods. Some examples of the diverse data produced by the field include analysis of genomic, proteomic, and metabolomics sequencing; computational biology models; biodiversity measurements; and records and models of protein expression, regulation, and structure. The potential application of such computational analysis is limitless, but work is focused primarily on creating ways to effectively store, process, and manipulate large data sets, on deriving statistical or mathematical analysis from such data, and on creating and analyzing models of important molecular, physiological, and ecological systems. Bioinformatics has potential medical, agricultural, and biological applications, both commercially and academically, as the patterns derived from samples and modeling can be used to better understand, develop, and optimize treatments, products, and crops. Bioinformatics, commonly used by the health care system to manage large amounts of patient data, is now being used in international collaborations focused on understanding disease states and normal physiology for commercial purposes as well. This entry looks at the security and privacy issues specific to bioinformatics.
The cost savings to companies from the gathering of data via a computerized modeling system, rather than traditional wet-bench biology, led to a dramatic increase in the formation of bioinformatics companies beginning in the 1990s. However, this rapid increase in the types and sources of bioinformatics data meant that the data collected by these companies were a new source of security and privacy concerns for individuals and corporate entities trying to protect their interests.
One challenge intrinsic to data privacy and security in the field of bioinformatics is that a large proportion of bioinformatics solutions have been developed in open source software, such as Perl and Unix. This was welcomed by groups concerned with the cost of obtaining software code from proprietary corporate databases and by developers (often academics, e.g., students and researchers) who shared a philosophical belief in the widespread sharing of data. During the dot.com decline of 2000, many companies preferred and encouraged the open source movement in bioinformatics, due to cheaper utilization costs. Other supporters argue that open source software is more reliable and better developed as broad usage and diffuse expertise allow optimization. They hope that, in exchange for the tools needed to conduct research, researchers will freely contribute to ongoing projects. However, the ready availability of open source code allows easier hacking of the information developed from these bioinformatics systems. It also became difficult to define and protect intellectual property and commercial interests with universally available data.
Besides hoping to maintain the integrity of the data itself, companies and researchers need to protect their commercial or academic interests in an increasingly competitive field. As such, it has become important to develop more secure methods to store and transfer data. Depending on the privacy needs of both the data and the data user, as well as the method of data sharing, multiple schemes have been proposed, some of which will be outlined briefly.
Another proposed solution is to create a “trusted third party,” which is then used either as a method to transfer encrypted data while maintaining input and query secrecy or as a way to store data in a secure but accessible form. Interestingly, although these proposed solutions increase the level of data protection as desired, the field’s intrinsic wish to maintain data accessibility and sharing is still evident. It is understood that larger and more complete data sets provide higher-quality analyses, and as such the data are still available, albeit in a secure form.
One especially pertinent example of potential concerns facing privacy, especially in the health informatics subdiscipline of the bioinformatics field, is the increasing utilization of large-scale genomic databases. These data sets are used to study the association between genomic composition and molecular, organ, and tissue-level systems, a study that has proven essential to understanding the genetic predisposition to complicated medical disorders. This information, which allows researchers to make advances in the knowledge and treatment of disease, brings fear of identification and discrimination for those individuals carrying medically stereotyped genetic information.
Questions of patient privacy are complicated by the nature of the data: Genomic information by definition is the ultimate identification tool, which carries the further risk of implicating family members. To protect personal privacy, steps are taken to anonymize or pseudonymize the data, if identity is not required (as in a health care setting). This is often done automatically on collection by assigning each genome a randomized ID, but further precautions must be taken with the genetic information itself to prevent potential reidentification of samples. Potential solutions include deleting or altering incriminating sequences, adding extra “noise” sequences, or providing only short, nonincriminating sequences relevant to the researchers’ query.
Sharing genomic data must be done in a secure way, and this is one example of the potential application of the trusted third party, which can function as an encryption system and a further layer of de-identification for the genomes collected. There are still limitations to protecting data in this way, as the nature of the genetic material that allows identification is continually changing as science progresses. This material must accordingly be continually monitored for incriminating sequences.
This rather extreme example highlights potential difficulties as well as the warranted necessity to protect data and the inevitable compromises made to open-access data to maintain the level of privacy warranted.
Annika C. Lee and Ian C. Clift
See also Cloud Computing ; Privacy, Medical
Akgün, Mete, et al. “Privacy Preserving Processing of Genomic Data: A Survey.” Journal of Biomedical Informatics, v.56 (2015). doi:10.1016/j.jbi.2015 .05.022
Claerhout, B. and G. J. E. DeMoor. “Privacy Protection for Clinical and Genomic Data: The Use of Privacy-Enhancing Techniques in Medicine.” International Journal of Medical Informatics, v.74/2–4 (2005).
Greco, Joseph F. “The Commercialization of Bioinformatics and the Threat of Open-Source Software.” Journal of Commercial Biotechnology, v.14 (2007). doi:10.1057/palgrave.jcb.3050051
Kamm, Liina, et al. “A New Way to Protect Privacy in Large-Scale Genome-Wide Association Studies.” Bioinformatics, v.29/7 (2013). doi:10.1093/bioinformatics/btt066
Kesh, Someswar and Wullianallur Raghupathi. “Critical Issues in Bioinformatics and Computing.” Perspectives in Health Information Management, v.1/9 (2004).
Martin-Sanchez, F., et al. “Synergy Between Medical Informatics and Bioinformatics: Facilitating Genomic Medicine for Future Health Care.” Journal of Biomedical Informatics, v.37 (2004).
Perl, H., et al. “Privacy/Performance Trade-Off in Private Search on Bio-Medical Data.” Future Generation Computer Systems, v.36 (2014). doi:10.1016/j.future .2013.12.006