Mining pharma’s data

SAN DIEGO—Pharmaceutical companies will collaborate with researchers at the University of California, San Diego to provide previously unreleased proprietary data for drug discovery through a new $3.7-million effort funded by the National Institutes for Health.

The project, which is led by UC San Diego principal investigators Drs. Rommie Amaro, Victoria Feher and Michael K. Gilson, includes a major subcontract to Rutgers University, directed by Dr. Stephen K. Burley of the Research Collaboratory for Structural Bioinformatics Protein Data Bank.

The data provide atomic details of drug mechanisms and will be used to improve computer-aided drug-design methods with the aim of accelerating drug discovery.

“One of the challenges in medical research is the paucity of real-world data available to academic researchers and other interested parties to develop new and improved methods for computer-aided drug discovery,” says Amaro, associate professor of chemistry and biochemistry at UC San Diego. “Pharmaceutical companies generate lots of data in-house as they conduct drug research, but often have difficulty sharing these datasets, due to legal and technical barriers.”

“This project is all about helping companies release the high-quality data they have generated, which has incredible value to researchers working to improve methods of computer-aided drug discovery,” she continues. “Companies want to help, because everyone stands to benefit from the ability to develop new medications more quickly and inexpensively.”

The new Drug Design Data Resource (D3R) will span UC San Diego’s Skaggs School of Pharmacy and Pharmaceutical Sciences, Department of Chemistry and Biochemistry, Center for Research in Biological Systems, Center for Drug Discovery Innovation (CDDI) and the San Diego Supercomputer Center (SDSC). The D3R is being administered through the Center for Research in Biological Systems (CRBS), which is based at the UC San Diego Qualcomm Institute.

D3R researchers will act as “data brokers,” collaborating with scientists and attorneys in the pharmaceutical industry to identify, evaluate, release and enhance useful industrial datasets. The data will then be made available to the drug discovery research community in a manner designed to maximize value, longevity and impact. The benefit to the scientific and medical communities will be the enhancement of platform computational technologies, which can then be used to discover drugs for any number of diseases, including cancer, Alzheimer’s disease and diabetes.

A dataset will ideally be comprised of approximately 50 or more compound structures provided in smiles (simplified molecular-input line-entry system) or sdf format, Feher notes, along with the associated Kd, Ka, Ki or IC50 assay values, and at least five target co-crystal structures in pdb format. “There are likely to be cases where crystal structures will be further refined by Dr. Stephen Burley’s group at Rutgers,” Feher adds, “in which case, crystallographic structure factors for electron density mapping may also be provided by the company.”

The UC San Diego team is developing a publicly available highly networked webpage, drugdiscoverydata.org, that will provide a portal to search and download the datasets collected and connect to related resources such as PDB, PubChem, BindingDB, MOAD, ChEMBL, etc. This webpage will also provide information, participation instructions and challenge dataset downloads.

Multiple industrial partners are currently being recruited to the project. Gilson, a professor of pharmacy and co-director of the CDDI, notes that the D3R will work closely with pharmaceutical companies to publicly release data for use by researchers developing new software to speed the discovery of new medicines for a variety of diseases.

“Negotiations have begun, and we have found several companies [which] are very enthusiastic about the project,” Feher states. “Three companies expressed their support when we initiated our grant submission, and we are looking to them for our initial datasets. Members of the computational community, whether in pharmaceutical companies or academia, recognize the value in having these datasets publicly available.”

“The drug companies, for example, might provide the structure of a protein and 50 molecules created during a drug discovery project, as well as how well those drugs and drug candidates bind to their protein targets,” adds Gilson. “Researchers worldwide will use these data as benchmarks to test their methods and improve their accuracy.”

Feher, a project scientist in the UC San Diego Department of Chemistry and Biochemistry and lead for discovery resources at the CDDI, says these types of real-world challenges represent a powerful way to test and improve computational methods and thereby speed drug discovery and reduce costs.

“Unfortunately drug discovery is still, to a large extent, trial and error,” says Feher, who spent a decade working for the pharmaceutical industry before coming to UC San Diego. “What computational chemists globally are trying to do is to make faster, more accurate, more predictive programs to speed up the process. Part of our mission is to engage the community in these challenges to test newly developed predictive algorithms. We will have annual meetings wherein performance of various methods will be evaluated and discussed. Outcomes will guide methodology improvements, which can then be tested again in the next D3R community challenge.”

Adds Gilson: “There’s a sense that, although computational drug discovery is already useful, it hasn’t fulfilled its potential, and that the calculations are not as accurate as they could be. We want to help the research community objectively identify the strengths and weaknesses of existing methods, so the results can be fed back into a process of continuous improvement and thus advance the field.”