CAMBRIDGE, Mass.—Researchers from the MassachusettsInstitute of Technology (MIT) and Harvard University have described a newalgorithm that offers a significant reduction in the amount of time required tolocate a particular gene sequence in a genome database. Not only that, but thealgorithm offers greater speedup the more genomes it's searching. Thedevelopment has the potential to offer some relief, given that genomesequencing is becoming a staple in research, with a full-genome sequencing nowable to be completed in a couple of weeks. Though the rate at which genomes canbe sequenced has been doubling roughly every four months since 2002, thecomputing power for such sequencing has only been doubling roughly every 18months.
"You have all this data, and clearly, if you want to storeit, what people would naturally do is compress it," Bonnie Berger, a professorof applied math and computer science at MIT and senior author on the paper,said in a press release. "The problem is that eventually you have to look atit, so you have to decompress it to look at it. But our insight is that if youcompress the data in the right way, then you can do your analysis directly onthe compressed data. And that increases the speed while maintaining theaccuracy of the analyses."
The algorithm takes advantage of the fact that there isoften extensive overlap in the genomes of closely related species, and evensome overlap in distantly related ones. Berger, Michael Baym, Ph.D., and Po-RuLoh, her former and current grad students respectively, developed a way tomathematically represent different species' genomes so that overlapping data isonly stored once. This allows searches of multiple genomes to focus on thedifferences between the genomes and save time.
When the team experimented on a database of 36 yeastgenomes, they compared their new algorithm to BLAST, the Basic Local AlignmentSearch Tool, which is one of the most commonly used genomic-search algorithmsin biology. The experiment saw the new algorithm performing twice as fast asBLAST in a search of 10 genomes, and four times as fast in a search of all 36genomes.
"If I want to run a computation on my genome, it takes acertain amount of time," said Baym in a press release. "If I then want to runthe same computation on your genome, the fact that we're so similar means thatI've already done most of the work."
The new algorithm is expected to have potential in a varietyof applications in which researchers are looking to identify similar genomicsequences, such as the identification of microbes. It could aid researchers indetermining the causes of infections, to characterize "microbiomes," tocharacterize microbes in fertile or infertile soil and possibly even in thestudy of physical evidence in forensics.
Berger, Baym and Loh are now working to extend theirtechnique to also work with information on proteins and RNA sequences.
SOURCE:MIT press release