Genomics is 'big data beast'
7 Jul 2015
The field of genomics has been dubbed “the alpha beast in the big data forest” by a group of experts at Cold Spring Harbor Laboratory, US.
The group, comprised of mathematicians and computing academics, said genomics is beginning to generate the most electronic bytes per year relative to all other fields.
Though YouTube currently generates the most data at roughly 100 petabytes per year, the group said genomics is growing far more rapidly, doubling the amount of data it produces every seven months.
“Genomics is a game-changing science in so many ways
Professor Michael Schatz
Figuring out how to capture, store, process and interpret all the genome-encoded biological information, stripped down to symbolic and, by themselves meaningless, ones and zeros, is the first step in a “grand-challenge” problem, the experts said.
“For a very long time, people have used the adjective ’astronomical’ to talk about things that are really, truly huge,” said Michael Schatz, an associate professor at the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory, and a co-author of the research.
“But in pointing out the incredible pace of growth of data-generation in the biological sciences, my colleagues and I are suggesting we may need to start calling truly immense things ’genomical’ in the years just ahead,” he said.
’Four-headed beast’
Biological data that is the “raw material” of genomics is highly distributed, and Schatz described it as a “four-headed beast” because of the fundamental problems of data acquisition, storage, distribution and analysis.
If all the human sequence data generated so far were put in a single place - about 250,000 sequences - it would require around 25 petabytes of storage space. That is a manageable problem, Schatz said.
However, by 2025, Schatz expects as many as 1 billion people to have their full genomes sequenced - posing an exabyte-level storage problem.
“The point of sequencing a billion genomes is not really to make a billion separate lists saying, ’If you have these variants, you have the following risks.’ Of course, individuals will want to look at the list of DNA variants they possess. But the real power of having 1 billion human genomes comes from ways of comparing them and combining layers of analysis,” Schatz said.
“Genomics is a game-changing science in so many ways,” he said.
“My colleagues and I are saying that it’s important to think about the future so that we are ready for it.”