COLLEGE STATION, Texas—In their quest to sequence the DNA of tuberculosis (TB) strains, researchers at Texas A&M University have run into an unfortunate bottleneck: analyzing data on specific strains often takes longer than it takes for the strains to mutate again. That race against time is getting a boost, thanks to supercomputing technology developed by the university and computational giant IBM that enables the researchers to sequence the DNA of a specific strain in mere hours instead of days.
This accelerated research has the potential to save the lives of the nearly two million people who die of TB every year, according to the World Health Organization, which also estimates that one-third of the human population could be carrying a latent infection. According to the researchers, their work is needed to prevent drug-resistant strains from becoming the dominant form of TB—but first, they needed to speed up their analysis time to get ahead of possible strain mutations.
The researchers had been using Illumina's Genome Analysis Pipeline to analyze the genome of mutated strains, but ran into an unfortunate bottleneck in the very first step of their analysis. Because the Illumina software was designed to run on a single box, the scientists could not perform analysis on a cluster. This meant that it took about nine hours to move nearly 253,000 pictures from a typical experiment—about two terabytes of data—across campus.
That is when the researchers tapped the expertise of Dr. Raffaele Montuoro, a computational scientist at Texas A&M's Supercomputing Facility, and requested access to the facility's 832-core "Hydra" cluster, an IBM Power system running AIX.
"The researchers' pipeline has to do three things: analyze all pictures, establish the sequence and then statistical analysis. The bottleneck was in the first step, the raw analysis of data," Montuoro explains. "At the most, you could analyze eight pictures at a time."
Montuoro and his colleagues developed a parallel version of Illumina's Genome Analysis Pipeline, called pGAP, to run on the cluster, which has sped up the analysis time by a factor of four. pGAP allows data to flow almost automatically from an Illumina Genome Analyzer IIx to the Hydra cluster, where it can quickly process in parallel the large datasets created by DNA sequencing.
With pGAP, Texas A&M researchers can now control how fast they want to process their raw datasets by selecting the appropriate number of central processing units (CPUs), Montuoro says. Currently, speed up is 4.3 times on 128 CPUs, but it increases to 5.2 times on 256 CPUs, and further on more CPUs.
The Texas A&M Supercomputing Facility recently increased its processing muscle even further with the addition of a 2,592-core IBM iDataPlex, a highly scalable system that can lower power, cooling and space requirements. Known as "Eos," the new cluster totals 27.14 teraflops. While the Hydra cluster has a total of 832 CPUs, the new IBM iDataPlex offers 2,592 CPUs ready for the job, Montuoro notes.
"The researchers found a particular strain of TB that is very drug resistant and mutates rapidly," Montuoro says. "They needed to perform their analysis before the mutation in a race against time. One way to do that is to speed up the sequencing cycle. In a typical experiment, you have to wait until the analysis is done to know if you are producing meaningful data. If you have the tools to start analyzing the data before the whole sequencing cycle is over, and find out if something went wrong, that can save money and time. The faster you are able to sequence the DNA, the more time you have to work on the next step, which would be trying to find where the mutation is."
The technology is so powerful that it could be applied in other labs across the Texas A&M campus, and could be especially helpful in the study of hospital-borne illnesses and infections, Montuoro adds.
"The present challenge is that this sort of bacteria grows in an isolated environment, and is exposed to a lot of drugs, making it resistant," he says. "If can sequence these strains as fast as we now can, we can try to apply these tools the same way people are applying them to TB. pGAP is the beginning of an effort that the Texas A&M Supercomputing Facility is leading to provide advanced computational resources and expertise to the life science community at TAMU. Our software is part of high-throughput 'physical' pipelines for genome analysis, which will constantly exchange data between sequencers and high-performance computing systems, allowing researchers of any expertise to process their data quickly and efficiently."
Janis E. Landry-Lane, program director at IBM World-wide Deep Computing, notes that IBM supplies the middleware, or the parallel environment that enabled Texas A&M to create pGAP.
"This is part of the IBM HPC software stack," Landry-Lane says. "Texas A&M writes to the interface that we provide. The extremely fast hardware interconnect that allows the nodes to communicate is called the High Performance Switch. So, it is a combination of hardware and software."
IBM has been involved in creating many other hardware/software environments for use to assemble and align sequence data for customers, Landry-Lane adds.
"We have used different technologies depending on the customer requirements," she says.
This accelerated research has the potential to save the lives of the nearly two million people who die of TB every year, according to the World Health Organization, which also estimates that one-third of the human population could be carrying a latent infection. According to the researchers, their work is needed to prevent drug-resistant strains from becoming the dominant form of TB—but first, they needed to speed up their analysis time to get ahead of possible strain mutations.
The researchers had been using Illumina's Genome Analysis Pipeline to analyze the genome of mutated strains, but ran into an unfortunate bottleneck in the very first step of their analysis. Because the Illumina software was designed to run on a single box, the scientists could not perform analysis on a cluster. This meant that it took about nine hours to move nearly 253,000 pictures from a typical experiment—about two terabytes of data—across campus.
That is when the researchers tapped the expertise of Dr. Raffaele Montuoro, a computational scientist at Texas A&M's Supercomputing Facility, and requested access to the facility's 832-core "Hydra" cluster, an IBM Power system running AIX.
"The researchers' pipeline has to do three things: analyze all pictures, establish the sequence and then statistical analysis. The bottleneck was in the first step, the raw analysis of data," Montuoro explains. "At the most, you could analyze eight pictures at a time."
Montuoro and his colleagues developed a parallel version of Illumina's Genome Analysis Pipeline, called pGAP, to run on the cluster, which has sped up the analysis time by a factor of four. pGAP allows data to flow almost automatically from an Illumina Genome Analyzer IIx to the Hydra cluster, where it can quickly process in parallel the large datasets created by DNA sequencing.
With pGAP, Texas A&M researchers can now control how fast they want to process their raw datasets by selecting the appropriate number of central processing units (CPUs), Montuoro says. Currently, speed up is 4.3 times on 128 CPUs, but it increases to 5.2 times on 256 CPUs, and further on more CPUs.
The Texas A&M Supercomputing Facility recently increased its processing muscle even further with the addition of a 2,592-core IBM iDataPlex, a highly scalable system that can lower power, cooling and space requirements. Known as "Eos," the new cluster totals 27.14 teraflops. While the Hydra cluster has a total of 832 CPUs, the new IBM iDataPlex offers 2,592 CPUs ready for the job, Montuoro notes.
"The researchers found a particular strain of TB that is very drug resistant and mutates rapidly," Montuoro says. "They needed to perform their analysis before the mutation in a race against time. One way to do that is to speed up the sequencing cycle. In a typical experiment, you have to wait until the analysis is done to know if you are producing meaningful data. If you have the tools to start analyzing the data before the whole sequencing cycle is over, and find out if something went wrong, that can save money and time. The faster you are able to sequence the DNA, the more time you have to work on the next step, which would be trying to find where the mutation is."
The technology is so powerful that it could be applied in other labs across the Texas A&M campus, and could be especially helpful in the study of hospital-borne illnesses and infections, Montuoro adds.
"The present challenge is that this sort of bacteria grows in an isolated environment, and is exposed to a lot of drugs, making it resistant," he says. "If can sequence these strains as fast as we now can, we can try to apply these tools the same way people are applying them to TB. pGAP is the beginning of an effort that the Texas A&M Supercomputing Facility is leading to provide advanced computational resources and expertise to the life science community at TAMU. Our software is part of high-throughput 'physical' pipelines for genome analysis, which will constantly exchange data between sequencers and high-performance computing systems, allowing researchers of any expertise to process their data quickly and efficiently."
Janis E. Landry-Lane, program director at IBM World-wide Deep Computing, notes that IBM supplies the middleware, or the parallel environment that enabled Texas A&M to create pGAP.
"This is part of the IBM HPC software stack," Landry-Lane says. "Texas A&M writes to the interface that we provide. The extremely fast hardware interconnect that allows the nodes to communicate is called the High Performance Switch. So, it is a combination of hardware and software."
IBM has been involved in creating many other hardware/software environments for use to assemble and align sequence data for customers, Landry-Lane adds.
"We have used different technologies depending on the customer requirements," she says.