EVENTS | VIEW CALENDAR
One byte at a time
EAST LANSING, Mich.—The joke is probably as old as when humans discovered pachyderms: How do you eat an entire elephant? One bite at a time. The challenge is much the same with massive amounts of genomic data these days, but researchers at Michigan State University (MSU) have developed a new computational technique, featured in the current issue of the Proceedings of the National Academy of Sciences (PNAS), that they say relieves the logjam created by "big data. "
The work came out of research on microbial communities living in the soil or the ocean—which tend to be quite complicated—and the realization that while it is relatively easy to collect massive amounts of data on microbes, it may take days to transmit the resulting files to other researchers and months to analyze them thereafter.
So, C. Titus Brown, an MSU assistant professor in bioinformatics, and his colleagues have demonstrated a general technique that can be applied on most microbial communities using small computers rather than supercomputers.
To thoroughly examine a gram of soil, for example, Brown says, you need to generate about 50 terabases of genomic sequencing, which is something like 1,000 times more data than was generated for the initial Human Genome Project, and it would take at least 50 laptops just to store that much data, much less do anything with it.
MSU describes the analysis of such DNA with traditional computing methods as being comparable to trying to eat a large piece of pizza in a single bite. The new method employs a filter that "folds" the metaphorical pizza up compactly using a special data structure, enabling computers to nibble at slices of the data and eventually digest the entire sequence. Reportedly, this technique creates a 40-fold decrease in memory requirements.
"It was clear that splitting these big data sets up in some way was necessary for scaling, and—as we found out later— other people were already using the idea that the data sets consisted of many different organisms to improve the quality of the analysis," Brown says. "The trick was that loading the data in was the problem, so it didn't matter if you split them up after that—the challenge was loading them in the first place. So we noodled around with a couple of different concepts, and eventually came up with this approach of using a probabilistic graph representation that would let us load all the data in without necessarily representing it perfectly."
First, though, they had to invent the data structure, and the PNAS paper—titled "Scaling metagenome sequence assembly with probabilistic de Bruijn graphs"—is ostensibly about the data structure, he notes, although it does actually demonstrate that the approach works as well.
"We started the basic work just over two years ago, and it took us about six months to have something that basically worked. We ran into a big stumbling block that had to do with real-world data, though, and that took us another six months to figure out; the solution is not represented in this paper, it's actually in a paper that we're just about to submit," Brown explains. "In this paper, we start the journey and show that it works on some data sets. The next paper will show how to get it to work under a specific kind of erroneous data and the third paper will actually use it 'in anger' on data sets that can't actually be tackled any other way, as far as we know."
In terms of scaling, he notes, it's not clear that biomedical fields working with microbial ecology and the metagenomics of human-associated communities —such as the Human Microbiome Project or MetaHIT—will need this kind of technique.
"The microbial communities that live in and on humans seem to be much less complex than communities in the soil or in the water and so you just don't need as much data to look at them. That's not to say that extra efficiency isn't nice, just that it doesn't seem to be a roadblock to progress," Brown says. "However, an important component of our follow-on work is on validating the quality of our and others' approaches, though, because that's central to what we've been doing. I think there will be significant positive repercussions from this on things like studying the evolution of drug resistance in human- associated microbes, and studying pathogenicity and virulence of enteric pathogens as members of the community, but it's a bit too soon to point at something right now. This is definitely more fundamental research than applied research."
Brown and Jim Tiedje, the university's distinguished professor of microbiology and molecular genetics at MSU, made the complete source code and the ancillary software available to the public to encourage extension and improvement, and will continue their own line of research as well. At least one research group, Brown notes, has already modified the technique and taken it in a new direction to built a better genome assembler.