One byte at a time

EAST LANSING, Mich.—The joke isprobably as old as when humans discovered pachyderms: How do you eatan entire elephant? One bite at a time. The challenge is much thesame with massive amounts of genomic data these days, but researchersat Michigan State University (MSU) have developed a new computationaltechnique, featured in the current issue of the Proceedings of theNational Academy of Sciences (PNAS), that they say relieves thelogjam created by "big data."

The work came out of research onmicrobial communities living in the soil or the ocean—which tend tobe quite complicated—and the realization that while it isrelatively easy to collect massive amounts of data on microbes, itmay take days to transmit the resulting files to other researchersand months to analyze them thereafter.

So, C. Titus Brown, an MSU assistantprofessor in bioinformatics, and his colleagues have demonstrated ageneral technique that can be applied on most microbial communitiesusing small computers rather than supercomputers.

To thoroughly examine a gram of soil,for example, Brown says, you need to generate about 50 terabases ofgenomic sequencing, which is something like 1,000 times more datathan was generated for the initial Human Genome Project, and it wouldtake at least 50 laptops just to store that much data, much less doanything with it.

MSU describes the analysis of such DNAwith traditional computing methods as being comparable to trying toeat a large piece of pizza in a single bite. The new method employs afilter that "folds" the metaphorical pizza up compactly using aspecial data structure, enabling computers to nibble at slices of thedata and eventually digest the entire sequence. Reportedly, thistechnique creates a 40-fold decrease in memory requirements.

"It was clear that splitting thesebig data sets up in some way was necessary for scaling, and—as wefound out later—other people were already using the idea that thedata sets consisted of many different organisms to improve thequality of the analysis," Brown says. "The trick was that loadingthe data in was the problem, so it didn't matter if you split themup after that—the challenge was loading them in the first place. Sowe noodled around with a couple of different concepts, and eventuallycame up with this approach of using a probabilistic graphrepresentation that would let us load all the data in withoutnecessarily representing it perfectly."

First, though, they had to invent thedata structure, and the PNAS paper—titled "Scalingmetagenome sequence assembly with probabilistic de Bruijn graphs"—isostensibly about the data structure, he notes, although it doesactually demonstrate that the approach works as well.

"We started the basic work just overtwo years ago, and it took us about six months to have something thatbasically worked. We ran into a big stumbling block that had to dowith real-world data, though, and that took us another six months tofigure out; the solution is not represented in this paper, it'sactually in a paper that we're just about to submit," Brownexplains. "In this paper, we start the journey and show that itworks on some data sets. The next paper will show how to get it towork under a specific kind of erroneous data and the third paper willactually use it 'in anger' on data sets that can't actually betackled any other way, as far as we know."

In terms of scaling, he notes, it'snot clear that biomedical fields working with microbial ecology andthe metagenomics of human-associated communities—such as the HumanMicrobiome Project or MetaHIT—will need this kind of technique.

"The microbial communities that livein and on humans seem to be much less complex than communities in thesoil or in the water and so you just don't need as much data tolook at them. That's not to say that extra efficiency isn't nice,just that it doesn't seem to be a roadblock to progress," Brownsays. "However, an important component of our follow-on work is onvalidating the quality of our and others' approaches, though,because that's central to what we've been doing. I think therewill be significant positive repercussions from this on things likestudying the evolution of drug resistance in human-associatedmicrobes, and studying pathogenicity and virulence of entericpathogens as members of the community, but it's a bit too soon topoint at something right now. This is definitely more fundamentalresearch than applied research."

Brown and Jim Tiedje, the university'sdistinguished professor of microbiology and molecular genetics atMSU, made the complete source code and the ancillary softwareavailable to the public to encourage extension and improvement, andwill continue their own line of research as well. At least oneresearch group, Brown notes, has already modified the technique andtaken it in a new direction to built a better genome assembler.