Beyond algorithms: building stronger AI foundations for drug discovery Artificial intelligence is reshaping how scientists study disease and design drugs — but its success depends on one foundation: high-quality, well-governed data. Modern drug discovery produces more data than ever, from genomics and chemical screening to imaging and patient datasets. The challenge is no longer access, but interpretation. Artificial intelligence (AI) is reshaping that challenge, helping researchers navigate chemical and biological complexity, and accelerate decision-making — revealing hidden patterns, predicting molecular behavior, and advancing new drug modalities. Yet alongside this progress come new challenges. The use of AI prompts today questions about trust, transparency, and the balance between human and algorithmic intuition (1). Moreover, the success of AI hinges on the quality of the data that feeds it, and existing datasets are often fragmented, biased, or incomplete. Too often, setbacks reveal weaknesses in the very data foundations on which AI depends. Strengthening those foundations will be essential to drive meaningful progress in drug discovery. How scientists use AI — and why success remains elusive AI now powers multiple stages of the drug discovery pipeline — from mining biomedical literature and predicting drug–target interactions to automating image-based screening and managing clinical trials (2). Deep learning, natural language processing, and computer vision are accelerating analyses once constrained by manual effort. Models like AlphaFold have revolutionized protein structure prediction, while large language models extract insights from millions of papers to uncover disease mechanisms and identify repurposing opportunities (3,4). As these capabilities expand, organizations are rapidly moving toward broader adoption. A 2025 survey of more than 200 experts across pharma, biotech, software, services, academia, and non-profits shows that over three-quarters of life-science laboratories plan to integrate AI within the next two years, and that AI remains their top investment priority (5). Yet major barriers still limit large-scale implementation in laboratory environments. Experts in the survey point to low-quality, poorly curated datasets and data that fail to meet FAIR standards (findability, accessibility, interoperability, and reusability) (5). For instance, available datasets often contain missing or biased information: failed results are underreported, formats are inconsistent, and labeling biological data for efficacy and safety analysis remains difficult (2,6). There are also concerns over privacy and security of sensitive data, shortages of personnel with the right skill sets, and gaps in governance and regulation (5). Beyond these constraints, developing high-quality machine learning (ML) models also demands substantial computational resources, both in terms of specialized hardware and energy consumption. AI’s promise in life sciences is undeniable, but realizing it demands more than powerful algorithms. It requires strategic leadership: investing in clean, well-governed data; fostering collaboration across research, clinical, and digital teams; and cultivating data fluency across the organization. Building AI on solid ground Future efforts to address these challenges should therefore focus on a common foundation for success in AI: data. ML models require large, diverse, and high-quality datasets to reach their full predictive potential, yet no single organization has enough data to achieve this. Addressing data scarcity will require leveraging data from multiple organizations and establishing common data standards (2). A promising path lies in collaborative platforms that use frameworks such as federated learning, which trains models directly across local data sources without exchanging raw data, and data mesh, a decentralized architecture in which each domain team manages its data as a product. By enabling organizations to contribute insights without sharing underlying data, these approaches enhance privacy, scalability, and regulatory compliance while helping overcome data limitations. Equally important is extracting greater value from existing data. Without clean, well-prepared datasets, even the most advanced AI models can mislead rather than inform. Experimental results, imaging files, genomic profiles, and metadata are often scattered, inconsistent, or incomplete, complicating analysis. Converting raw data to AI-ready formats requires data governance, cleaning, management, and analysis to reduce algorithmic bias and improve transparency and reproducibility (7). The push toward FAIR data principles is increasingly recognized “Available datasets often contain missing or biased information: failed results are underreported, formats are inconsistent, and labeling biological data for efficacy and safety analysis remains difficult.” as essential to realizing AI’s potential in life sciences by improving discovery, integration, and reuse by humans and machines (8). A critical enabler in this process is the use of ontologies: human-created, machine-readable frameworks that define types of entities and their relationships. By standardizing terminology and encoding domain knowledge, ontologies make data structured, consistent, and interpretable across disciplines. Solutions like Revvity’s Signals One™ use built-in FAIR data structures and ontology support to capture, harmonize, and structure data from different disciplines and teams into a unified analytical framework. By ensuring that data are clean, annotated, and interoperable, AI-ready data allows downstream AI tools to operate efficiently and reliably, reducing noise and enhancing predictive power. The human in the loop AI does not replace scientific expertise — it amplifies it. While models can generate hypotheses and accelerate decision-making, interpretation and validation remain firmly in human hands. Drug candidates predicted by AI still require experimental confirmation, and scientists must decide how and when to apply different models based on project needs. Proper training in AI use is therefore critical to ensure informed, context-specific decisions (9). At the same time, reproducibility and data integrity remain vital to ground models in truth (9). Ultimately, what AI truly transforms is the pace and precision of discovery. By integrating structured data, advanced analytics, and collaborative workflows, solutions like Signals One help turn experiments into insight faster — empowering researchers to advance therapies with greater precision and confidence. REFERENCES 1. He, J., Hua, C., Wang, Y. & Zheng, Z. Collaborative Intelligence in Sequential Experiments: A Human-in-the-Loop Framework for Drug Discovery. Preprint at https://doi.org/10.48550/ARXIV.2405.03942 (2024). 2. Zhang, K. et al. Artificial intelligence in drug development. Nat Med 31, 45–59 (2025). 3. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). 4. Liu, X. et al. Application of artificial intelligence large language models in drug target discovery. Front Pharmacol 16, 1597351 (2025). 5. Pistoia Alliance & Open Pharma Research. 2025 The Evolution of Labs Report, the three year overview. (2025). 6. Bender, A. & Cortes-Ciriano, I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov Today 26, 1040–1052 (2021). 7. Kidwai-Khan, F. et al. A roadmap to artificial intelligence (AI): Methods for designing and building AI ready data to promote fairness. J Biomedical Inform 154, 104654 (2024). 8. Wilkinson, M.D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). 9. Hasselgren, C. & Oprea, T.I. Artificial Intelligence for Drug Discovery: Are We There Yet? Annu Rev Pharmacol Toxicol 64, 527–550 (2024). Stronger, well-governed data foundations, including approaches like federated learning, are key to building trustworthy AI in drug discovery. CREDIT: ISTOCK.COM/ VISUAL GENERATION