MALDEN, The Netherlands—Researchers at the Selected Organic Reactions Database (SORD) and Advanced Chemistry Development (ACD/Labs) announced a collaboration to harvest the contents of academic theses and dissertations worldwide. In the process, they hope to open a whole history of chemical reaction data to the global scientific community.
"Worldwide, there are tens of thousands of theses detailing millions of syntheses," explains Dr. Antony Williams, ACD/Labs CSO. "This work represents the collective efforts of thousands of man-years of innovation, intellect and advances in science."
"We did a full analysis of one of the biggest compound databases with 9 million entries and arrived at less than 500,000 compounds that you could begin to describe as 'medicinally interesting'," says SORD CSO Dr. Dick Wife. "Then you make a simple calculation about how many compounds have been prepared in universities over the last 40 years and you come up with a figure around 50 million. So, 40 million compounds for one reason or another never got published."
To achieve their goals, SORD scientists are relying on infrastructure support from ACD/Labs, which has an extensive portfolio of tools for the handling and sharing of chemical data. But the company is also likely to provide support in terms of added value to the data SORD scientists catalogue.
"Since ACD/Labs develops algorithms for structure-based prediction of physicochemical properties, for nomenclature generation and for spectral prediction, it is likely that ACD/Labs will extend the data content of the SORD database to provide access to a selection of these properties," Williams adds.
SORD is being developed at a time when numerous groups are expanding the repertoire of chemical data repositories. For example, aside from commercial repositories like Chemical Abstract Services and Elsevier MDL's CrossFire Bielstein, the NIH recently established PubChem, with the goal of providing information about the biological activities of small-molecule compounds.
According to Wife, SORD is different from and yet complementary to these efforts, but rather than just being digital, the information in SORD is electronic and can therefore be searched with modern datamining techniques.
Initially, the repository is being filled with historical data that is of particular interest to pharmaceutical companies, according to Wife, including reactions that offer good yield, no metal contamination, easy isolation or separation, and little or no waste products. He expects the project to surpass one million records within the first five years.
The longer term goal, however, is to have a system that becomes relatively self-sustaining, Williams suggests.
"Our hope is that the value the database delivers to the scientific committee will catalyze an interest in contributing data to the system on an ongoing basis and not necessarily await the publication of a specific thesis," he says.