BURLINGAME, Calif.—In late 2018, Collaborative Drug Discovery Inc. (CDD) announced that they had won a competitive, peer-reviewed Phase 1 SBIR grant from the National Institutes of Health’s (NIH) National Center for Advancing Translational Sciences (NCATS), entitled “Digital representation of chemical mixtures to aid drug discovery and formulation.” CDD proposes to develop a suite of software modules to enable scientists to unambiguously represent chemical mixtures in standard machine-readable formats.
The International Chemical Identifier (InChI) has emerged over the past decade as the predominant international standard to concisely represent chemical structures as encoded text strings that computers can quickly index, sort, search and compare. However, in practice, chemicals are typically formulated as mixtures. Even if a mixture consists of one principal ingredient that is well represented by an InChI identifier, the mixture may also contain solvents, adjuncts, cofactors, impurities, etc. that also need to be properly recorded.
A working committee of the International Union of Pure and Applied Chemistry (IUPAC) is close to formalizing “Mixtures InChI” (MInChI), which will extend InChI to become the first standard to encompass mixtures. MInChI will effectively index mixtures in the same way that InChI indexes individual compounds.
InChI provides a concise, canonical summary of a compound rather than a complete structural representation. Other formats, such as the Molfile, hold the full structure information which can then be converted into an InChI identifier. Similarly, in order for MInChI to gain traction, new data structures and methods are needed to process the information to be converted into MInChI codes, retaining the additional important descriptive details that InChI identifiers intentionally discard.
CDD proposes to develop this data structure—tentatively called “Mixfile” in analogy to Molfile—along with associated conversion routines and a visual editor essential to implement the MInChI standard. Dr. Alex Clark, a CDD scientist, is a member of the IUPAC/MInChI working committee which is encouraging CDD to address this critical prerequisite to adoption of MInChI. CDD has committed to distribute these key infrastructural elements as open formats and open-source software.
The company also recently won another Phase 1 SBIR grant from NIH NCATS, entitled “Novel Deep Learning Strategy to Better Predict Pharmacological Properties of Candidate Drugs and Focus Discovery Efforts.” CDD is planning to develop a novel approach based on deep-learning neural networks to encode molecules into chemically rich vectors.
CDD is aiming to first apply this representation to build more powerful computational models that can more accurately predict properties—such as bioactivity, ADME/Tox and pharmacokinetics—across libraries of molecular structures. The ultimate goal with this work is to leverage this representation to generate novel compounds with better combinations of properties.
When building a model to predict a pharmacological property of a series of molecules, computational chemists start by selecting what they believe to be the relevant chemical features, then assemble vectors of molecular descriptors (or fingerprints) that characterize these features in order to represent the molecules and perform a regression analysis over the vectors. This approach reduces dimensional complexity and makes the model tractable, but also throws away much important structural information about the molecules.
In recent years, many groups have tried to work around this limitation by applying deep-learning techniques, but these efforts have only improved the prediction accuracy of computational chemistry models in a handful of cases, where large sets of assay data are available to train the models. CDD’s approach is to apply deep learning to the more focused problem of encoding the features of molecules into chemically rich vectors. The company plans to do this by coupling the encoder to a complementary decoder, creating an autoencoder, then training both neural networks jointly by asking them to try to make the output of the decoder identical to the input to the encoder.
Self-training the autoencoder doesn’t require any assay data, but instead relies on feeding it molecular structures (which are conveniently curated in the millions). The chemically rich vector is a narrow layer wedged between the encoder and the decoder to create an information bottleneck, forcing it to become rich in chemical structure information. CDD’s approach is to extract this chemically rich vector and repurpose it as a substitute for conventional molecular descriptors to improve existing predictive models of any type.
The chemically rich vectors will make existing models tractable without sacrificing structural detail. The computationally intense training can be performed once, then applied to diverse problems. CDD says their preliminary tests support the hypothesis that the richer structural information preserved in these novel vectors will significantly enhance the performance and ease of use of predictive regression models.
CDD notes that specific aims for Phase 1 are: to re-implement the chemical autoencoder strategy with a new architecture that accepts a natural representation of the molecular graph as input, and show a substantial improvement in performance compared with their current architecture based on SMILES strings, which follow an obscure grammar; and to exploit the chemically rich vectors to develop ~4 predictive models for diverse pharmacological properties of general interest, and compare the performance of the models with the best-published benchmarks.