CAMBRIDGE, Mass.—Massachusetts Institute of Technology (MIT) researchers have developed a cryptographic system that could help neural networks identify promising drug candidates in massive pharmacological datasets, while keeping the data private. Secure computation done at such a massive scale could enable broad pooling of sensitive pharmacological data for predictive drug discovery.
Datasets of drug-target interactions (DTI), which show whether candidate compounds act on target proteins, are critical in helping researchers develop new medications. Models can be trained to crunch datasets of known DTIs and then, using that information, find novel drug candidates.
In recent years, pharmaceutical firms, universities and other entities have become open to pooling pharmacological data into larger databases that can greatly improve training of these models. Due to intellectual property matters and other privacy concerns, however, these datasets remain limited in scope. Cryptography methods to secure the data are so computationally intensive they don’t scale well to datasets beyond, say, tens of thousands of DTIs, which is relatively small.
In a paper published recently in Science, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) described a neural network securely trained and tested on a dataset of more than a million DTIs. The network leverages modern cryptographic tools and optimization techniques to keep the input data private, while running quickly and efficiently at scale.
The team’s experiments show the network performs faster and more accurately than existing approaches; it can process massive datasets in days, whereas other cryptographic frameworks would take months. Moreover, the network identified several novel interactions, including one between the leukemia drug imatinib and an enzyme ErbB4—mutations of which have been associated with cancer—which could have clinical significance.
“People realize they need to pool their data to greatly accelerate the drug discovery process and enable us, together, to make scientific advances in solving important human diseases, such as cancer or diabetes. But they don’t have good ways of doing it,” said corresponding author Bonnie Berger, the Simons Professor of Mathematics and a principal investigator at CSAIL. “With this work, we provide a way for these entities to efficiently pool and analyze their data at a very large scale.”
The researchers pitted their network against several state-of-the-art, plaintext (unencrypted) models on a portion of known DTIs from DrugBank, a popular dataset containing about 2,000 DTIs. In addition to keeping the data private, the researchers’ network outperformed all of the models in prediction accuracy. Only two baseline models could reasonably scale to the STITCH dataset, and the researchers’ model achieved nearly double the accuracy of those models.
The researchers also tested drug-target pairs with no listed interactions in STITCH, and found several clinically established drug interactions that weren’t listed in the database but should be. In the paper, the researchers list the top strongest predictions, including droloxifene and an estrogen receptor, which reached Phase 3 clinical trials as a treatment for breast cancer, and seocalcitol and a vitamin D receptor to treat other cancers. The co-first authors of the paper— Brian Hie and Hyunghoon Cho, both graduate students in electrical engineering and computer science and researchers in CSAIL’s Computation and Biology group—independently validated the highest-scoring novel interactions via contract research organizations.
The work could be “revolutionizing” for predictive drug discovery, notes Artemis Hatzigeorgiou, a professor of bioinformatics at the University of Thessaly in Greece. “Having entered the era of big data in pharmacogenetics, it is possible for the first time to retrieve a dataset of this unprecedented big size from patient data. Similar to the learning procedure of a human brain, artificial neural networks need a critical mass of data in order to provide confident decisions. Now [this] is possible the use of millions of data to train an artificial neural network toward the identification of unknown drug-target interactions. Under such conditions, it is not a surprise that this trained model outperforms all existing methods on drug discovery.”
Next, the researchers are working with partners to establish their collaborative pipeline in a real-world setting. “We are interested in putting together an environment for secure computation, so we can run our secure protocol with real data,” Cho says.
Edited from a story by Rob Matheson of the MIT News Office