IBM makes massive research data available to public
Partnership with NIH will connect researchers to information that would normally take months--or even years--for most companies or organizations to analyze and collect
NEW YORK—Researchers across multiple disciplines now haveaccess to massive amounts of curated patent, scientific content and moleculardata courtesy of new cloud-based analytics software from IBM and apublic-private partnership with the U.S. National Institutes of Health (NIH).The partnership will connect researchers to information that would normallytake months—or even years—for most companies or organizations to analyze andcollect.
Called the IBM Strategic IP Insight Platform, or SIIP, thesoftware is a combination of data and analytics delivered via the IBMSmartCloud. Having been in development at IBM for six years to support IBM'sown intellectual property strategy, the platform has been further developed incollaboration with several top pharmaceutical companies to aggregate and searchmillions of patents and chemical symbols—and now, thanks to a public-privatepartnership between IBM and the NIH, this data will be made publicly availableto the research community.
This announcement was made Dec. 8 at a forum called "U.S.Competitiveness: The Next 100 Years Forum with IBM" at Hunter College's RooseveltHouse Public Policy Institute, where IBM convened experts from private enterprise,government and academia to explore the growing importance of public-privatepartnerships in industry.
"At this one-time forum, we discussed how innovation wascreated in the past, and how private and public organizations need to gettogether to drive new initiatives and fuel the economy," says Dr. Ying Chen, anIBM researcher. "Making this information available through the NIH is agoodwill gesture to the scientific research community."
IBM, along with several partners in Big Pharma, has beenworking for six years to create the text-mining capabilities that will allowthe company to automatically extract chemical names and biological entitiesfrom diverse research content like patents, scientific literature andabstracts.
"We think of our partners as being subject matter experts,"says Chen. "As we developed the technology, we needed chemists to tell us if anextracted name was right. They really helped us enhance this technology on anongoing basis and shape it to where it is today. This is why we have a lot ofconsistency and quality in the speed of our annotation."
SIIP represents a new, cloud-driven method for curating andanalyzing massive amounts of patents, scientific content and molecular data. Usingtechniques such as automated image analysis and enhanced optical recognition ofchemical images and symbols, it extracts information from patents andliterature upon publication—a task that usually takes weeks, months or evenlonger to complete manually.
"This is the kind of data that requires significant domainexpertise to extract and derive. In the past, there have been companies thathired chemists to read documents and manually mark them up and fill outdatasets, then sell that information to researchers and corporations. That isan extremely expensive process, one that is not very accessible to most people,and the information obtained can be limited," Chen points out. "In addition,human beings make mistakes. So for example, you can read a document, mark up achemical name and miss a few characters. Then, when you enter it into adatabase, it becomes a different chemical compound, and you may not be able tomap different chemical names to the right chemical structure.
"We can take large clusters of text and extract chemicalnames very quickly," she continues. "We can identify published patents andliterally hours later get the chemical names into our database—and make all ofthis information available to the public."
In addition to chemical name extraction, SIIP also provides"normalization of a chemical name," Chen says.
"Each chemical name or single compound can have a variety ofnames that can map to it. We also have some normalization and mapping of thatname to its unique chemical compound and all of its different associatednames," she explains. "In addition, we have connected these chemical compoundswith patents that mention them, so the end user has the ability to understandwho has what patent portfolio for that particular compound."
For those engaged in drug discovery, having access to thisdata is expected to increase the quality of patent filings and help researchersidentify collaboration partners, acquisition targets and new business opportunities.However, IBM expects multiple industries—such as oil/gas and agriculture—tobenefit from having immediate access to this kind of information.
"We're so excited about working with this volume of data,and at the same time, we're excited about providing this new commercialproduct. It's a big win-win for both the general public and the researchcommunity, and for IBM and our clients," says Chen.
IBM will contribute the data to the National Center forBiotechnology Information (NCBI), which is part of the National Library ofMedicine (NLM), and the Computer-Aided Drug Design (CADD) Group of the NIH'sNational Cancer Institute (NCI). It will be incorporated in the NCBI's PubChem,a public resource for the scientific community that serves as an aggregator forscientific results, as well as in NCI CADD Group services such as the ChemicalStructure Lookup Service and the Chemical Identifier Resolver.
The NIH will make the content available on PubChem at http://pubchem.ncbi.nlm.nih.gov.