Take a really good look
Mining data may be the easy part; visualizing it for proper analysis and development is the big challenge now
Even within the ranks of the drug discovery and development community, it was easy to get caught up in the idea that the mapping of the human genome and continuing advances in high-throughput screening would yield a bursting harvest of promising leads. Instead, it might be better said that we have much bigger haystacks with better quality needles hidden inside them.
But unlike the public at large, the research community had a better idea of the hurdles that lay between them having a great deal of data and being able to use that data effectively. Visualization of the data is one of the key challenges facing researchers.
Acquisition tools that allow more and deeper mining of data keep improving by leaps and bounds. The visualization tools that aid in making the data manageable and amenable to analysis are also improving, notes David Lowis, senior director of product management for Tripos, but they aren't keeping pace.
Notable improvements have occurred and continue to occur with visualization tools, adds Mark Bayliss, vice president and chief technical officer of Advanced Chemistry Development Inc. (ACD/Labs), so even though they are likely continue to lag behind acquisition tools for quite some time, there is hope on the horizon and the promise of tools that will put more analysis power into the hands of researchers, not just the bioinformatics experts, but also the community at large.
The pixel problem
"We're not short on information, that much is obvious," Bayliss says. "There are so many automated high-throughput technologies that we're drowning in information. Whether we're talking toxicology or peptides or genomics, there is a great deal of data being generated. Spotfire, which has one of the top visualization suites out there, recently gave a presentation in which they asked, 'How do we get to the point of visualizing not just potentially millions but perhaps billions of data elements?' I don't think we're at the point of needing that yet, but with genomics and proteomics work, we're getting closer and closer, and it takes time to adopt new technologies, so we need to think about such things."
The problem isn't just the need for better storage or data management or even computers with tons of processing power. A more immediate problem is squeezing a huge volume of data into a space that cannot handle it all, notes Jim Robinson, a senior software engineer with the Broad Institute of MIT and Harvard. The average researcher has a matter of thousands of pixels on his or her computer monitor, but drilling deeply into some data and looking at multiple parameters, particularly with the genome, can easily overwhelm the system.
Making such data manageable has typically required a fair amount of bioinformatics experience, Robinson notes. Most researchers, as skilled and knowledgeable as they are, do not count informatics expertise among their skills. So, those who have the expertise and the time can do major work beforehand to break the data into small enough chunks for a browser to handle.
"Our epigenetics group was using next-gen technologies to make very high-resolution chromatin measurements, but with so many base pairs across the genome, they lacked a tool that could visualize that much data without a lot of preprocessing," he recalls. "And we had another project involving the Cancer Genome Atlas that is trying to map out all the mutations and do a cancer map of the genome—and that involves expression, copy number variation, methylation and all sorts of other data."
In trying to address the needs for both groups to have a responsive tool, Robinson and his colleagues found inspiration from Google Maps, which allows people to get very focused and high-resolution views down to street level in the world. The Integrative Genomics Viewer (IGV) that the Broad came up with (see story page 26) doesn't work quite as smoothly as Google Maps and it doesn't use the exact same technology, but it uses a similar "trick," Robinson says.
Google Maps doesn't load up the whole area you want to view all at once. It gives access to enormous amounts of data, but only provides what you need when you need it, and this could be a key, he says, in making data visualization more accessible to a broader range of researchers.
"As with Google Maps, we want to let researchers get down the street level, where the addresses are the various combinations of ACTG in the genome," Robinson explains. "So, the trick we used was to divide the universe we were looking at into resolution levels, and at each level, only the amount of data that you can actually see on your monitor at that level is loaded into your view. If you can only see so much at one time, there is no reason to give your tool more data than that. So we just serve up the little pieces as you need them."
Too many flavors?
Many different kinds of visualization techniques exist, Tripos' Lowis points out, but he hasn't seen any particularly large changes in the use of visualization tools, nor does he foresee the emergence any time soon of a breakthrough technique. But what he does see, and is very glad to be witnessing, he says, is the movement of more tools into the wider drug discovery and development community in a more accessible format.
"What is happening, more so than changes in visualization techniques, is the spread of dedicated high-tech software that is used only by a few into more generalized software applications," Lowis says. "The more exotic capabilities are becoming more available to the general population, particularly as people create more access tools. You see this on the chemistry side with Spotfire, for example, which is probably the most heavily used visualization tool. It has a wealth of capabilities both for the expert users as well as those who simply want more basic functions."
Everyone knows about histograms, for example, but important visualization tools such as similarity maps and the like generally appeal to a smaller and more focused group of experts.
"And then there are radar plots, which have always been touted in chemistry groups as an interesting way to visualize all the different climates you need to look at in optimization cycles," Lowis adds. "You've seen those more in home-grown software but not as much in commercial software, but that is beginning to change and radar plots are now seeing wider use."
Deploying a new kind of visualization into a research program that is already juggling several different technologies, each with its own learning curve, is difficult, notes ACD/Labs' Bayliss.
As more commercially available software suites come out with a wider array of visualization tools built into them, though, researchers can become exposed to—and gain interest in learning about—techniques they might not have considered yet, Bayliss and Lowis note.
As Lowis points out, researchers may not always use a new technique to its fullest capability, but having access to it might still allow for some insights they might not otherwise have gotten.
Still, it takes time for new techniques to become standard practice in science, Bayliss points out, because there is both a comfort level and certain degree of efficiency in continuing to do things "the way you've always done them," he says. "It's not just the technological barriers, but the human ones that we have to overcome."
Speed and standards
One of the factors that might lower human resistance will be to ensure that, as Robinson has been doing at the Broad Institute, companies make visualization tools that give researchers what they need, fast.
"The visualization tools have to be intuitive, but above all, they need to be fast," Bayliss says. "Regardless of how far you drill down and whether you are looking at thousands, millions or billions of data elements, it has to be real-time. Also, it has to be a drag-and-drop style in most cases, where you can pull new data elements in or pull them out as you look for a certain granularity of information."
Also critical are common format standards. Many vendors provide wonderful technological solutions for data visualization and analysis, Bayliss says, but what is lacking are "normalized formats."
With multiple types of equipment and software being used in any given discovery or development operation, researchers need to know that they can put in 10 different types of inputs or data and know that they will be treated in similar ways across all the different technologies, he says.
"From sample prep to data handling, a tremendous amount of detailed thought needs to go into making sure everything is consistent and reproducible," he says. "There is a lot of investment in this area and there are any number of groups working on it different areas of research, but it will take time to get to a point where the data are ready to go into a variety of basic visualization tools from different vendors, much less the more complex visualization approaches that more people are getting interested in."
Does Microsoft 'sea' the future?
At Microsoft LiveLabs, one of the ongoing projects is Seadragon, which comes out of the acquisition of a company by the same name in early 2006. The Seadragon technology allows people to quickly view large images from their computers or other electronic device, so that, for example, a person could zoom in close on a single word from hundreds of displayed pages of a novel or find a single small road on vast map of the world.
In demos and reports about the Seadragon project, it is said the technology allows speedy navigation independent of the size or number of objects you have, performance that depends only on the ratio of bandwidth to pixels on the screen, smooth transitions and "near perfect" and rapid scaling for screens of any resolution.
Blaise Agüera y Arcas, one of the developers working on Seadragon, says the technology is really about "doing away with the limits on screen real estate." This is something that Mark Bayliss, vice president and chief technical officer of ACD/Labs, says may be a critical technology as visualization tools move from the need to handle potentially millions of data elements to maybe even needing to handle billions in the not-too-distant future.
So far, the buzz and the demos have focused more on the ability of Seadragon to handle multitudes of high-resolution photographic or scanned images, but Bayliss suspects the technology would work well for the data mining and data analysis needs faced by drug discovery and development researchers.
The idea that the Seadragon technology might one day be adapted to pharma isn't so crazy, given Microsoft has already set itself up as a major presence in the industry with the formation of the BioIT Alliance, a group of organizations spanning pharmaceutical, biotech, hardware and software industries that is exploring new ways to share complex biomedical data and collaborate among multi-disciplinary teams to speed the pace of discovery in the life sciences. DDN
But unlike the public at large, the research community had a better idea of the hurdles that lay between them having a great deal of data and being able to use that data effectively. Visualization of the data is one of the key challenges facing researchers.
Acquisition tools that allow more and deeper mining of data keep improving by leaps and bounds. The visualization tools that aid in making the data manageable and amenable to analysis are also improving, notes David Lowis, senior director of product management for Tripos, but they aren't keeping pace.
Notable improvements have occurred and continue to occur with visualization tools, adds Mark Bayliss, vice president and chief technical officer of Advanced Chemistry Development Inc. (ACD/Labs), so even though they are likely continue to lag behind acquisition tools for quite some time, there is hope on the horizon and the promise of tools that will put more analysis power into the hands of researchers, not just the bioinformatics experts, but also the community at large.
The pixel problem
"We're not short on information, that much is obvious," Bayliss says. "There are so many automated high-throughput technologies that we're drowning in information. Whether we're talking toxicology or peptides or genomics, there is a great deal of data being generated. Spotfire, which has one of the top visualization suites out there, recently gave a presentation in which they asked, 'How do we get to the point of visualizing not just potentially millions but perhaps billions of data elements?' I don't think we're at the point of needing that yet, but with genomics and proteomics work, we're getting closer and closer, and it takes time to adopt new technologies, so we need to think about such things."
The problem isn't just the need for better storage or data management or even computers with tons of processing power. A more immediate problem is squeezing a huge volume of data into a space that cannot handle it all, notes Jim Robinson, a senior software engineer with the Broad Institute of MIT and Harvard. The average researcher has a matter of thousands of pixels on his or her computer monitor, but drilling deeply into some data and looking at multiple parameters, particularly with the genome, can easily overwhelm the system.
Making such data manageable has typically required a fair amount of bioinformatics experience, Robinson notes. Most researchers, as skilled and knowledgeable as they are, do not count informatics expertise among their skills. So, those who have the expertise and the time can do major work beforehand to break the data into small enough chunks for a browser to handle.
"Our epigenetics group was using next-gen technologies to make very high-resolution chromatin measurements, but with so many base pairs across the genome, they lacked a tool that could visualize that much data without a lot of preprocessing," he recalls. "And we had another project involving the Cancer Genome Atlas that is trying to map out all the mutations and do a cancer map of the genome—and that involves expression, copy number variation, methylation and all sorts of other data."
In trying to address the needs for both groups to have a responsive tool, Robinson and his colleagues found inspiration from Google Maps, which allows people to get very focused and high-resolution views down to street level in the world. The Integrative Genomics Viewer (IGV) that the Broad came up with (see story page 26) doesn't work quite as smoothly as Google Maps and it doesn't use the exact same technology, but it uses a similar "trick," Robinson says.
Google Maps doesn't load up the whole area you want to view all at once. It gives access to enormous amounts of data, but only provides what you need when you need it, and this could be a key, he says, in making data visualization more accessible to a broader range of researchers.
"As with Google Maps, we want to let researchers get down the street level, where the addresses are the various combinations of ACTG in the genome," Robinson explains. "So, the trick we used was to divide the universe we were looking at into resolution levels, and at each level, only the amount of data that you can actually see on your monitor at that level is loaded into your view. If you can only see so much at one time, there is no reason to give your tool more data than that. So we just serve up the little pieces as you need them."
Too many flavors?
Many different kinds of visualization techniques exist, Tripos' Lowis points out, but he hasn't seen any particularly large changes in the use of visualization tools, nor does he foresee the emergence any time soon of a breakthrough technique. But what he does see, and is very glad to be witnessing, he says, is the movement of more tools into the wider drug discovery and development community in a more accessible format.
"What is happening, more so than changes in visualization techniques, is the spread of dedicated high-tech software that is used only by a few into more generalized software applications," Lowis says. "The more exotic capabilities are becoming more available to the general population, particularly as people create more access tools. You see this on the chemistry side with Spotfire, for example, which is probably the most heavily used visualization tool. It has a wealth of capabilities both for the expert users as well as those who simply want more basic functions."
Everyone knows about histograms, for example, but important visualization tools such as similarity maps and the like generally appeal to a smaller and more focused group of experts.
"And then there are radar plots, which have always been touted in chemistry groups as an interesting way to visualize all the different climates you need to look at in optimization cycles," Lowis adds. "You've seen those more in home-grown software but not as much in commercial software, but that is beginning to change and radar plots are now seeing wider use."
Deploying a new kind of visualization into a research program that is already juggling several different technologies, each with its own learning curve, is difficult, notes ACD/Labs' Bayliss.
As more commercially available software suites come out with a wider array of visualization tools built into them, though, researchers can become exposed to—and gain interest in learning about—techniques they might not have considered yet, Bayliss and Lowis note.
As Lowis points out, researchers may not always use a new technique to its fullest capability, but having access to it might still allow for some insights they might not otherwise have gotten.
Still, it takes time for new techniques to become standard practice in science, Bayliss points out, because there is both a comfort level and certain degree of efficiency in continuing to do things "the way you've always done them," he says. "It's not just the technological barriers, but the human ones that we have to overcome."
Speed and standards
One of the factors that might lower human resistance will be to ensure that, as Robinson has been doing at the Broad Institute, companies make visualization tools that give researchers what they need, fast.
"The visualization tools have to be intuitive, but above all, they need to be fast," Bayliss says. "Regardless of how far you drill down and whether you are looking at thousands, millions or billions of data elements, it has to be real-time. Also, it has to be a drag-and-drop style in most cases, where you can pull new data elements in or pull them out as you look for a certain granularity of information."
Also critical are common format standards. Many vendors provide wonderful technological solutions for data visualization and analysis, Bayliss says, but what is lacking are "normalized formats."
With multiple types of equipment and software being used in any given discovery or development operation, researchers need to know that they can put in 10 different types of inputs or data and know that they will be treated in similar ways across all the different technologies, he says.
"From sample prep to data handling, a tremendous amount of detailed thought needs to go into making sure everything is consistent and reproducible," he says. "There is a lot of investment in this area and there are any number of groups working on it different areas of research, but it will take time to get to a point where the data are ready to go into a variety of basic visualization tools from different vendors, much less the more complex visualization approaches that more people are getting interested in."
Does Microsoft 'sea' the future?
At Microsoft LiveLabs, one of the ongoing projects is Seadragon, which comes out of the acquisition of a company by the same name in early 2006. The Seadragon technology allows people to quickly view large images from their computers or other electronic device, so that, for example, a person could zoom in close on a single word from hundreds of displayed pages of a novel or find a single small road on vast map of the world.
In demos and reports about the Seadragon project, it is said the technology allows speedy navigation independent of the size or number of objects you have, performance that depends only on the ratio of bandwidth to pixels on the screen, smooth transitions and "near perfect" and rapid scaling for screens of any resolution.
Blaise Agüera y Arcas, one of the developers working on Seadragon, says the technology is really about "doing away with the limits on screen real estate." This is something that Mark Bayliss, vice president and chief technical officer of ACD/Labs, says may be a critical technology as visualization tools move from the need to handle potentially millions of data elements to maybe even needing to handle billions in the not-too-distant future.
So far, the buzz and the demos have focused more on the ability of Seadragon to handle multitudes of high-resolution photographic or scanned images, but Bayliss suspects the technology would work well for the data mining and data analysis needs faced by drug discovery and development researchers.
The idea that the Seadragon technology might one day be adapted to pharma isn't so crazy, given Microsoft has already set itself up as a major presence in the industry with the formation of the BioIT Alliance, a group of organizations spanning pharmaceutical, biotech, hardware and software industries that is exploring new ways to share complex biomedical data and collaborate among multi-disciplinary teams to speed the pace of discovery in the life sciences. DDN