EVENTS | VIEW CALENDAR
All about the algorithms
CAMBRIDGE, Mass.óResearchers from the Massachusetts Institute of Technology (MIT) and Harvard University have described a new algorithm that offers a significant reduction in the amount of time required to locate a particular gene sequence in a genome database. Not only that, but the algorithm offers greater speedup the more genomes it's searching. The development has the potential to offer some relief, given that genome sequencing is becoming a staple in research, with a full-genome sequencing now able to be completed in a couple of weeks. Though the rate at which genomes can be sequenced has been doubling roughly every four months since 2002, the computing power for such sequencing has only been doubling roughly every 18 months.
"You have all this data, and clearly, if you want to store it, what people would naturally do is compress it," Bonnie Berger, a professor of applied math and computer science at MIT and senior author on the paper, said in a press release. "The problem is that eventually you have to look at it, so you have to decompress it to look at it. But our insight is that if you compress the data in the right way, then you can do your analysis directly on the compressed data. And that increases the speed while maintaining the accuracy of the analyses."
The algorithm takes advantage of the fact that there is often extensive overlap in the genomes of closely related species, and even some overlap in distantly related ones. Berger, Michael Baym, Ph.D., and Po-Ru Loh, her former and current grad students respectively, developed a way to mathematically represent different species' genomes so that overlapping data is only stored once. This allows searches of multiple genomes to focus on the differences between the genomes and save time.
When the team experimented on a database of 36 yeast genomes, they compared their new algorithm to BLAST, the Basic Local Alignment Search Tool, which is one of the most commonly used genomic-search algorithms in biology. The experiment saw the new algorithm performing twice as fast as BLAST in a search of 10 genomes, and four times as fast in a search of all 36 genomes.
"If I want to run a computation on my genome, it takes a certain amount of time," said Baym in a press release. "If I then want to run the same computation on your genome, the fact that we're so similar means that I've already done most of the work."
The new algorithm is expected to have potential in a variety of applications in which researchers are looking to identify similar genomic sequences, such as the identification of microbes. It could aid researchers in determining the causes of infections, to characterize "microbiomes," to characterize microbes in fertile or infertile soil and possibly even in the study of physical evidence in forensics.
Berger, Baym and Loh are now working to extend their technique to also work with information on proteins and RNA sequences.
SOURCE: MIT press release