One of the things I love about bioinformatics is how relatively new the field is. Bioinformatics; first coined in 1970’s correspondence between Hesper and Hogeweg as a term to describe “the study of informatics processes in biotic systems”1 has since evolved into an invaluable skill in a biologist’s skillset. As a multi-disciplinary field with a delicate balance of biology, computer science and statistics, it is well-known as the method used to deal with “big data” particularly in the field of genomics. But this isn’t how it started.
I recently stumbled across a recent article by Jeff Gauthier et al. called “A brief history of bioinformatics”2 on twitter and realised that for such an emerging field- I don’t know a whole lot about how it came about. So I read the paper and did a little investigating and thought I’d share a pared down version of it here. This is by no means a comprehensive cover of how all the tools and software we use today came about. It is more of an overview of what I feel are the most important things that happened to shape bioinformatics as a profession in clinical diagnostics today.
It’s the early 1950’s. Not a lot is known about DNA; controversy surrounds the role of DNA as the carrier molecule of genetic information. Avery, MacLeod and McCarty’s claims to this in 19443 are in dispute and protein is hailed as the carrier of genetic information. Finally in 1952 Hershey and Chase prove “beyond reasonable doubt” what we know today about DNA’s major role in genetics4. A year later, 1953, Watson, Crick and Franklin make the ground breaking discovery of the double-helix structure of DNA5. Even with this significant revelation, proteins are still far more understood than DNA, so this is where our story really begins…
Margaret Dayhoff (remember that name!) is renowned as the first ever bioinformatician. Recognising the potential for computational methods to solve biomedical problems, in 1962 she released the first de novo sequence assembler in collaboration with Robert Ledley6. The assembler looked at primary protein structure and was written in the programming language Fortran on punch-cards. This method, that was not only not an uncommon method for programming back then, but the norm, is mind boggling to me now. This assembler was called COMPROTEIN and also gave rise to the 3 letter and 1 letter amino acid codes we still use today. Not only did she do all this but she shared it with the world- creating the first ever biological sequence database: Atlas of Protein Sequence and Structure, with Richard Eck7.
Above: Basically Margaret Dayhoff is my hero and I think we should all have a “Day (h)off” once a year to celebrate just how much she has contributed to this field.
In 1963, the concept of Paleogenetics was introduced by Emile Zuckerkandl and Linus Pauling as a functional application of the protein structures in the Atlas to assess evolution over time8. This is the foundation upon which the idea of conservation of protein structure was built. In 1970, steps were taken towards the multiple sequence alignment (MSA) we know (and love) today, by Needleman and Wunsch with the first pairwise protein sequence alignment (the “Needleman-Wunsch global alignment algorithm)9. This problem proved quite difficult to solve, with several iterations of sequence alignment algorithms throughout the 70’s and 80’s before CLUSTAL was developed in 1988 (Birthday party for the big 3-0, anyone?!).
Meanwhile over in DNA’s corner, Crick had been busy deciphering the genetic code. In 1968 the Journal of Molecular Biology published Crick’s work on all 64 codons comprising the “language of DNA”10. This was a major turning point in DNA analysis and initiated a real focus on DNA sequencing as a viable area of research. Enter: Frederick Sanger, Allan Maxam, Walter Gilbert. These three men and their colleagues were responsible for the development of first generation sequencing. Maxam-Gilbert sequencing11, developed in 1976 was only popular for a short while before the much safer and simpler method, Sanger sequencing12, was developed in 1977.
It was recognised that DNA sequences could be more efficiently and rapidly analysed by computers than by humans (thanks to Dayhoff and Eck!) and Roger Staden was the first to exploit this in 1979 by developing an assembly of computer programs known as The Staden Package to do just that13. In 1981, Felsenstein developed the Maximum Likelihood method for building phylogenetic trees using nucleotide sequences14. This method succeeded the Parsimony method used for building phylogenetic trees using amino acid sequences due to the additional information DNA could provide. Maximum Likelihood is still used today to develop bioinformatics tools and provided a basis for the development of Bayesian statistics further down the line.
Until 1977 when the first consumer computers hit the market, computers were impractical, inaccessible and big. This was a turning point that I am going to dub “the baby boom of bioinformatics software”. The first of these was the Wisconsin Genetics Computer Groups software suite in 1984 to work with DNA, RNA or Protein sequences15. In the same year came DNASTAR16 and others followed. One thing I’ve noticed in bioinformatics is that members of the community are incredibly open to sharing, collaboration and helping each other. Predictably, this is not a recent movement; Richard Stallman was pioneering in the ethos of free software sharing in his GNU manifesto in 198517. It comes as no surprise then that shortly following this the European Molecular Biology Laboratory and GenBank united their databases (1986) as well as the DNA Data Bank of Japan (1987) and is now known as the International Nucleotide Sequence Database Collaboration.
So at this point we have computers, DNA sequences and bioinformatics software. There needed to be somewhere to share this knowledge: CABIOS: Computer Applications in the Biosciences now known as Bioinformatics, was a journal established in 1985. Clearly, bioinformatics was rapidly growing and starting to really take on the shape we see today. As more and more data was being produced, this required more and more processing power and ways to handle this data. Say hello to new computers and new programming languages: Unix operating systems were widely used for technical and scientific applications and scripting languages were being developed out of necessity. In 1987, Perl was created and between 1994 to the late 2000s used as the de facto bioinformatics programming language. Python was created slightly later in 1989, and with the addition of bioinformatics libraries in 2000, grew into a key programming language of the late 2000’s up to this day. Now we have over 250 different programming languages, albeit plenty of esoteric ones and a limited subsection that are widely used.
I think we’ve reached the point where people will start to know more about the things I’m talking about: The Human Genome Project which was initiated in 1991, the actual publication of the human genome in 200118 (which required new bioinformatics software specifically developed to handle the unprecedented amount of data) and NCBI and BLAST going online in 1994 and PubMed in 1997. This is really where bioinformatics started to become more accessible to those without an in-depth knowledge of command line and all this sharing gave rise to “big data”. So this is big data in terms of lots of sources. What about lots of data from one source? Yes… I think you know where I’m heading with this one. Next Generation Sequencing. Although Illumina is probably the most widely known sequencing platform company, the technology was first successfully provided commercially by 454 Life Sciences in 200519. I won’t go into the details of NGS as that’s not what this post is about, but it’s safe to say it was a breakthrough that propelled genomics and bioinformatics into the spotlight. The data produced from these methods was too big to use regular computers to process, so was stored on servers on high powered computers, with far more processing power and storage space than on a standard desktop. Accessing this data requires command line skills and manipulating it requires at least basic coding.
Cut to less than 10 years ago. 2012, marks the launch of the 100,000 genomes project- a major milestone in where the UK is heading in terms of bioinformatics resources today. With the requirement for more people with the skills to handle and analyse the massive amounts of data being produced both in research and diagnostic services, the terms “bioinformatics” and “bioinformatician” are cropping up more and more frequently. For this reason it is important to try and share what bioinformatics is, which is one reason we started this blog. However, Gauthier does recall a presentation from bioinformatician C. Titus Brown, where he predicts that in the future there will be no biology without the concept of bioinformatics, so both fields will simply be known as biology20.
Wow, that was a long one. Turns out a short history is still pretty long!
I really enjoyed the paper and learning how the things that we use on a daily basis now- particularly in a clinical diagnostic setting- were developed. It also made me appreciate how much leg work was done by people with a lot less technology than we have today; truly inspiring. I agree with the ideas presented in the paper that, as cool as saying “I’m a bioinformatician” sounds, soon the term will be somewhat redundant as all biologists will need to harness the power of computing for their work/research.
I hope it was informative and you enjoyed reading about it as much as I enjoyed writing it. If you feel like I’ve missed anything important I would love to know and you can tweet me or pop it down in the comments. Thanks again and until next time!