Biotech 3.0: Data as the Catalyst for the Next Revolution in Drug Discovery

The small-molecule field lacks the vast datasets available to other sectors, with much of the data being proprietary or incomplete. As a result, current ML methods often fail to generalise beyond the narrow scope of the data they are trained on, leading to suboptimal outcomes when applied to real-world drug discovery projects. Deane et al. discuss the performance of ML models over time, revealing that there is little or no improvement using CASF-2016 or HIV MoleculeNet and only incremental improvement using USPTO-50k, but no ‘AlphaFold 2 moment’.

This discrepancy underscores the importance of focusing not just on algorithms but on the quality and quantity of data available. For ML to reach its full potential in drug discovery, the datasets it relies on must be reliable, expansive, and representative of the complexity of small-molecule interactions. Without this foundation, even the best algorithms cannot deliver the breakthrough improvements the industry seeks.

Automated Laboratories: Industrial Data Factories

Scaling for Success: Addressing Data Quantity and Quality

A Data-Centric Future for Drug Discovery