CAMBRIDGE, U.K.—UK Research and Innovation (UKRI) has awarded £45 million to EMBL’s European Bioinformatics Institute (EMBL-EBI) to enhance the institute’s technical and building infrastructure. The funding, from the UKRI’s Strategic Priorities Fund, will support EMBL-EBI’s existing and emerging data resources, including areas of major interest such as genomics and bioimaging.
EMBL-EBI is a global leader in the storage, analysis and dissemination of large biological datasets. It hosts numerous centralized data resources, which are critically important for academic and commercial life science research. The volume and diversity of life science research data are growing rapidly, partly due to the rise of new technologies such as single-cell sequencing and cryo-electron microscopy. Open, freely available research data is an important driver for new discoveries, but scientific data sharing requires robust data resources.
“EMBL-EBI websites receive over 38 million requests for data or analysis every day,” said Ewan Birney, director of EMBL-EBI. “The demand for our data resources has risen dramatically in the last decade and we expect this trend to continue, so we need to be ready for when it happens.”
To meet the increasing demand for open-access data resources, EMBL-EBI plans to use the recently awarded UKRI funding to expand the institute’s technical infrastructure, including the setup of new data storage solutions as well as the scaling up of existing data resources. The funding will also be used to provide secure shared analysis platforms for major research collaborations.
“EMBL-EBI collaborates with academia, industry and governments to develop databases, tools and software that make life science research data available to all,” added Melanie Welham, executive chair of the Biotechnology and Biological Sciences Research Council (BBSRC), part of UKRI. “We hope that this funding will enable them to scale up the amazing work they are already doing within the life science community and beyond.”
In addition to the UKRI funding news, EMBL-EBI also notes that it has combined its knowledge of bacterial genetics and web search algorithms to build a DNA search engine for microbial data. The search engine, described in a paper published in Nature Biotechnology, could enable researchers and public health agencies to use genome sequencing data to monitor the spread of antibiotic resistance genes.
The search engine, called Bitsliced Genomic Signature Index (BIGSI), fulfills a similar purpose to internet search engines like Google. EMBL-EBI pointed out that the amount of sequenced microbial DNA is doubling every two years, and until now there was no practical way to search this data. By making this vast amount of data discoverable, the search engine could allow researchers to learn more about bacteria and viruses.
“This search engine complements other existing tools and offers a solution that can scale to the vast amounts of data we’re now generating,” noted Phelim Bradley, a bioinformatician at EMBL-EBI. “This means that the search will continue to work as the amount of data keeps growing. In fact, this was one of the biggest challenges we had to overcome. We were able to develop a search engine that can be used by anybody with an internet connection.”
Conventional search engines use natural language processing to search through billions of websites, and take advantage of the fact that human language is relatively unchanging. But microbial DNA shows the imprint of billions of years of evolution, so each new microbial genome can contain new “language” that has never been seen before. The key to making BIGSI work was finding a way to build a search index that could cope with the diversity of microbial DNA.
“We were motivated by the problem of managing infectious diseases and antibiotic resistance,” explained Zamin Iqbal, research group leader at EMBL-EBI. “We know that bacteria can become resistant to antibiotics either through mutations or with the help of plasmids. We also know that we can use mutations in bacterial DNA as a historical record of bacterial ancestry. This allows us to infer, to some extent, how bacteria might spread across a hospital ward, a country or the world. BIGSI helps us study all of these things at massive scale. For the first time, it allows scientists to ask questions such as ‘has this outbreak strain been seen before?’ or ‘has this drug resistance gene spread to a new species?’ Making genomics data searchable at this point is essential, and it will allow us to learn a huge amount about biology, evolution, the spread of disease, and much more.”
EMBL-EBI also recently launched the Common Infrastructure for National Cohorts in Europe, Canada and Africa (CINECA), an international project led by EMBL-EBI. A virtual cohort of data from 1.4 million individuals will be made accessible to approved researchers around the world through CINECA’s federated cloud-based network. Registered researchers will be able to analyze population-scale genomic and biomolecular data. Comprised of 18 partner organizations across three continents, CINECA has data from 11 cohorts selected to provide a diverse representation of studies in rare disease, common disease and national cohorts over time (longitudinal).
Federated international sharing of human data presents ethical and technical challenges. To protect patient privacy, access to the federated data cohorts will follow the established structure used by the Global Alliance for Genomics and Health, where researchers must formally apply for data access on an individual basis.
“By enabling access to genetic data from diverse human populations, CINECA will support the development of treatments tailored to each individual patient’s genetic profile, the ultimate goal of personalized medicine,” said Thomas Keane, team leader at EMBL-EBI. “Clinicians need to be able to compare a patient’s genome to a large set of healthy people and sick people, in order to understand the underlying genetics of the patient. And by large, we mean hundreds of thousands or even millions of other people.”
A key aim of CINECA is to develop tools which allow for rapid data discovery, secure access and authorization within the cloud. Such tools will enable researchers to quickly discover data which are relevant to ongoing research projects, without duplicating studies. This raises the potential for novel discoveries into causes of rare and common diseases, such as cancer and diabetes.