Machine learning-powered drug discovery: Now and Tomorrow
Part 1: Application, progress, and ecosystem dynamics
A month ago, I had a chat with a senior data scientist from big tech about ML for drug discovery. I was told that biotech companies do not have the right datasets and are not using ML tools correctly. Recently, a well respected biotech investor friend of mine told me that ‘ML isn’t really making a difference for drug discovery’. I found it quite interesting: These two conversations perfectly correlate with the skepticisms that ML drug discovery receives from both the tech and bio sides. There are a lot of noises and over-promises around ML drug discovery. However, rejecting a technology trend because of the existence of hype is like saying ‘web3 is all scams’. It feels like a Deja vu. In early 2000s when human genome had just been sequenced, people were rushing to get more sequencing data, more knockouts. As the dust settled, the value stayed. Now we have Illumina, 10x Genomics, and more. Therefore, I decided to write this article to give my perspectives on the ML drug discovery space. In the first part of the article, I will talk about ML applications across the drug discovery value chain and the key players and the dynamics of the ecosystem. In the second part, I’ll dive deep into the platforms, business models, and emerging trends in AI-first biotech companies.
I
The human body is a highly complex machine, consisting of over 20 000 genes and 25 000 proteins that work in coordination. A mutation in a single gene could cause errors in a network of proteins that leads to human diseases. Our understanding of the human body merely begins to scratch the surface of biological complexity. Looking for the right medicine for diseases is like searching for a needle in a haystack. Drug development is a frustratingly expensive, time-consuming process with a high failure rate: It takes 10-15 years and over $1B dollars to bring a drug to the market, and only 5% of early drug programs can successfully reach approval.
Machine learning (ML) are making inroads in drug discovery during the past decade. Vertically, ML has been applied to all steps of the drug discovery funnel and driven new efficiencies and insights across the whole value chain. Horizontally, ML has been adopted by multiple key players including biopharmas, big tech, AI-first biotech companies, and a variety of service providers. With this post, I will give an overview on where we stand now with ML-powered drug discovery.
ML technologies have been applied across the whole drug discovery value chain
Target identification
Drug discovery starts with identification of the disease-driving genes and protein targets. It takes discovery biologists years of experience and a large amount of literature review to suggest potential genes and pathways that may lead to disease phenotypes. ML tools, such as natural language processing and knowledge graphs, excel at mining literatures and databases and elucidating the complex relationships between compounds, genes and diseases. Target validation through functional genomics experiments uses ML to interpret and integrate disparate readouts such as cell morphology, omics profiles, and even animal pathogenesis. To prevent expensive failures in late-stage drug discovery, generating genetic evidence for target identification will become an increasingly important strategic priority. These ML tools have become an important component of the drug discovery platform for biotech and pharma companies.
Drug screening
Once the drug targets are identified, a large library of compounds (10^9-10^10 scale) will go through multiple rounds of screening and optimization. With the help of ML tools, this process can be done virtually. Several Deep Learning algorithms can learn from chemistry and physics principles and make predictions on drug-target binding, quantitative structure-property relationship (QSPR), quantitative structure-activity relationship (QSAR), and other pharmacological features. With these predictions, ML models can generate a shortlist of compounds with desirable features for further synthesis, testing and optimization in bio labs.
Drug design and lead optimization
In the molecular design space, generative modelling has been used for designing small molecule drugs and biologics and predicting their physicochemical properties. The prediction of pharmacokinetics profile of potential molecules using ML models reduces attrition rate in preclinical/clinical trials. So far, it seems the generative modelling is only capable of exploring the new chemical spaces adjacent to the existing ones. With more high quality training datasets and better ML algorithms, over the next five years, generative modelling could become an important part of computational chemistry toolkits, enabling companies to explore novel chemical spaces and broaden the pool of drug candidates.
Preclinical and clinical development
In addition, ML has been applied in biomarker identification, digital health data flow management and patient stratification for clinical trials. These precision medicine tools were pooled together to improve the quality and accuracy of clinical trials.
ML drives better timeline and capital efficiency in drug development
After the first decade since introducing ML to drug discovery, we started to see some early results. We can’t help but wondering whether the drugs designed by ML are necessarily better.
A study based on 24 AI-native drug discovery companies showed that ML-derived drug pipelines have expanded exponentially since 2010, with an annual growth rate of 36%. Based on disclosed data, there are 23 ML-derived drug-like molecules in the clinical trials, and over 150 in the preclinical stage by August 2022.
Judging from the target selection, the companies are making the safe bets: Most of the drugs were designed for well-defined targets and mechanism-of-actions (MoAs) that have been proven to play critical roles in disease progression, across oncology, neurobiology, autoimmune and rare diseases. There are disease areas with huge unmet needs and regulatory advantages. Working with known targets mitigates the target selection risk and allows the ML platform to focus on developing new compounds. The first ML-designed drug targeting a novel pathway is probably Insilico’s ISM001-055, which entered phase I in February 2022 as a treatment for a progressive lung disease. It would be very interesting to look into this new pathway once it is unveiled.
We do not have enough data yet from the clinical trials to evaluate the success rate of ML-derived drug candidates. Based on a few disclosed programs, there is a mix of hit-and-miss. DSP1181, an antagonist for serotonin 5-HT1A receptor designed by Exscientia, which made the headlines in early 2020 as the ‘first ML-designed drug entering clinical trials’, is now discontinued. Nimbus Therapeutics, on the other hand, just announced promising safety and PK/PD data from a phase I trial for its allosteric TYK2 inhibitor, NDI-034858, which is now advanced to phase IIb. Recursion is advancing two repurposed drugs to phase II, while cutting a third one from the pipeline after phase I. In the next five years, as more ML-designed drugs enter clinical trials, we will have more data to judge whether ML has helped us to select better drug candidates to enter clinical trials. To keep in mind, even ML can double the early compound success rate from 5% to 10%, there is still a 90% failure rate. Some of the ML-derived programs would still fail, but we are hoping for more to succeed. What really matters is how companies learn from the failures and successes and adjust their portfolio and strategies in choosing clinical assets.
Another advantage that ML promises is time efficiency. The initial data suggest ML has improved the drug development timeline. Drug repurposing is a good example. In February 2020 when COVID-19 started to spread globally, with its ML program BeneVolentAI took less than two weeks to identify baricitinib, an Arthritis drug developed by Eli Lilly, as a potential blocker for SARS-CoV2 entry. It took only 25 months from publication of the repurposing proposal to the final FDA approval of baricitinib for hospitalized COVID-19 patients.
On the de novo drug discovery side, multiple clinical-stage compounds completed the entire discovery and preclinical journey in less than 4 years. Such initial data points compare favorably to historical timelines in the industry standard of 5-6 years. Exscientia’s public data suggest their preclinical phase has even been shortened to an average of <12 months.
By moving faster, the companies can save some costs in development. More importantly, the sooner a drug gets approved, the longer it can be sold in the market before the patent expires. Thus, a faster timeline gives the drug developers and their investors an opportunity for better capital efficiency and a higher NPV.
Key players in the ML drug discovery ecosystem
Biopharmas are the big brother in the drug discovery industry. They have the infrastructure, talents, and experiences in developing drugs, running clinical trials and bringing drugs to the market. They operate in the ML drug discovery ecosystem by partnering with AI-native startups or big tech. With this strategy, biopharmas get access to the proprietary ML algorithms and databases the partnering companies generate, without deploying capital to building ML infrastructure and teams. By 2021, over 40 biopharmas have disclosed partnerships with ML platform-building companies. A few biopharmas decided to develop in-house ML capabilities to stay more active on the ML side, such as GSK. Other biopharmas opt to form joint ventures or Centre of Excellence to develop new ML technologies.
Big tech, such as Google, IBM and Microsoft, entered the ecosystem by leveraging their strong ML capabilities and engineering teams to support drug discovery. Big tech companies simply offer ML tools and are not specialised in any particular drug discovery segment. For example, Microsoft announced a strategic alliance with Novartis to apply their algorithms to the large datasets and help Novartis to identify and develop therapeutics. DeepMind, the offshoot of Google’s, developed a program called AlphaFold that can predict protein 3D structure from its amino acid sequences, which made a gargantuan leap for structure biology and drug development.
AI-native biotech companies started to emerge during the past 10 years. They are built around cutting-edge ML platforms that are devoted to improving one or more drug discovery processes. Majority of well-funded startups in the ML drug discovery space operate as drug development companies. They are biotech companies that integrate ML as a core component of their in-house drug discovery process. They typically have a few internal drug development programs and collaborate with biopharmas to develop certain drugs.
Service providers are typically startups that provide their ML platforms to other biotech and pharma companies. These companies usually do not have in-house drug development programs. Their services range from SaaS platforms that provide web tools and cloud solutions for data analysis and management, to wet lab facilities that can handle a-la-carte, high content experiments, to end-to-end ML drug discovery services. This is a rapidly growing sector of the ML drug discovery space.
ML application in drug discovery is still in its infancy, and the field is evolving rapidly. AI-native biotech startups, where the most active innovation happens, are the driver of continuous evolution in the ML drug discovery space. As the first generation of AI-native biotech companies mature, patterns start to emerge. They are showing some defining characteristics that are fundamentally different from typical asset-centric biotech companies. In the second part of this article, I will unpack these features. I will also talk about some emerging trends from the second-generation AI drug discovery companies.
Stay tuned.