Martin Farach-Colton, Professor of Computer Science at Rutgers University, says:

The human genome isn’t that big.  At three billion basepairs — each of which takes two bits to represent, even uncompressed — the entire genome can be represented in about 770MB, and even smaller when compressed.  So why is genomic data considered to be Big Data?

Modern genomic analysis does not depend on the DNA sequence of an individual.  Rather, many lines of research involve the comparative analysis of the DNA and other features of a population of individuals.  For example, the Philip Awadalla lab at the University of Montreal focuses on Medical and Population Genomics.  They consolidate genetic markers, such as so-called single-nucleotide polymorphisms (SNPs), with other data, such as gene-expression profiles.  They then correlate such data with the expression patterns of diseases.  They thus seek to address questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations.

The key word is populations.  One genome may be small, but get data on enough people and it adds up.  It’s pretty easy to bring the database infrastructure for the lab to its knees. Since researchers rely heavily on querying the data, a slow database can really get in the way of making research progress.

For example, researcher Thibault de Malliard, who oversees the lab’s data, points out that he adds hundreds of thousands of records to the lab’s MySQL database. But, as the database grew to 200 GB, its performance plummeted. And the lab had hopes of getting more than 1TB of data! 

Within the database, the bottleneck turned out to be the MyISAM storage engine.  De Maillard tried out Tokutek’s TokuDB database storage engine, which he had heard offered better performance on large data.  He set up two autojoined views of MYSQL, one with MyISAM and the other running TokuDB, then tested each with a 200 GB table containing two billion records, representing around 1500 samples with 1.3M positions, the lab’s current SNP set for CARTaGENE RNAseq. This was all performed on a Centos 5 server running with 48GB of RAM on 6 CPU Intel® Xeon® 2.27Ghz processors.

TokuDB won out over MyISAM for the following reasons:
  • Faster Inserts — Adding 1M records took 51 min for MyISAM, but 1 min for TokuDB. So inserting one sequencing batch with 48 samples and 1.5M positions would take 2.5 days for MyISAM but one hour with TokuDB. This number will increase as the tables grow with MyISAM, but it remains the same with TokuDB.
  • Flexibility — Any change made to the database structure will lock the table being modified by MyISAM until completion. TokuDB allows the lab to use the database while altering it.
  • Compression — TokuDB compresses the data (default set to normal). Less data goes through the network, and less data is written to the storage.

“Data management is very important for the genomic research lab. The researchers make a lot of queries, and they want their data at their fingertips. To find the rare record or line, which has not been seen in another DB in the world, can mean the discovery of a new mutation or a gene marker that causes a disease,” noted de Malliard. “With epidemiology data, we are searching to find some state for people who have an issue by comparing a subset of people vs. all the other people. TokuDB uniquely enables us to advance this research.”

Prof. Farach-Colton is an expert in algorithmics and information retrieval. He was an early employee at Google, where he worked on improving performance of the Web Crawl and where he developed the prototype for the AdSense system. Prof. Farach-Colton received his MD from The Johns Hopkins School of Medicine and his PhD in Computer Science from the University of Maryland.  Prof. Farach-Colton is a Professor of Computer Science at Rutgers University.