article

Part one: what can scientists do with LLMs today?

Recently there have been a flurry of announcements from AI-led biotechs around the potential of Large Language Models (LLM) in early drug discovery. In the first of a three-part series, Dr Raminderpal Singh explores what LLMs are, how early stage biotechs can take advantage of them, and what challenges they present.

Large Language Model (LLM)

LLMs, or deep learning models, have been around since the 1990s, but it wasn’t until the 2010s that their use became more popular. They have been used as the basis for AI assistants like Apple’s Siri and ChatGPT. Because of their ability to process vast sets of data there have been attempts to use them to accelerate progress across a range of industries. Early drug discovery is no exception and many AI-led biotechs, including Recursion1 and Insilico Medicine2, have released LLM-related announcements.

How do LLMs work?

ChatGPT3 says that an LLM is a type of artificial intelligence (AI) model designed to understand and generate human language. These models are based on deep learning architectures, particularly transformers, which enable them to process and generate text with a high degree of coherence and relevance. LLMs, like GPT-4, are trained on vast amounts of text data from the internet, books, articles and other sources. This training allows them to learn the complexities of language, including grammar, context and even some level of reasoning and common sense. The latter is a large claim, so scientists are right to be wary.

It is important to note that:

  • LLMs work by learning from user questions and iterations on the system. Essentially, they use and reuse your search terms, which is why some LLMs are cheap to access. 
  • LLMs are very expensive to build and train.
  • LLMs take advantage of known knowledge, and thereby are not great speculative tools (ie, when looking for insights in ‘dark spaces’). 
  • ChatGPT (which includes several add-ons) can be very useful for scientists. However, to make it effective, you must iterate the prompts and questions, which can be frustrating.

How to reduce LLM hallucinations

Also, the topic of LLM hallucinations needs to be addressed. This has made news4 several times. For scientists who have played around with LLMs, sometimes it is frustrating how unreal the answers can be. Hallucinations cannot be avoided when using non-curated knowledge. If biotechs can, they should curate their paper cohorts and upload them as the sole source for the analysis. However, hallucination checks should be built into the workflow.

How to use LLMs

Users of LLMs can be divided into two types:

  1. those who use what is available in a simple, and sometimes private, way.
  2. those who build and train their own models.

This series of articles will focus on the former type, mainly because the building of new models is currently too expensive for most.  

The simple use of LLMs (point (i) above) can be done in two ways. The first is to go online and use an interactive system like ChatGPT or use software that includes wrappers around an LLM. The latter approach will be discussed in part three of this series, published on Monday 15 July.

LLMs have their limitations and are not best-in-class across modelling domains. For example, AlphaFold 3 uses Diffusion Models (DMs) to get better results,5,6 as opposed to LLMs. Both DMs and LLMs come under the popular term Generative AI (GenAI). In addition, there is the term Foundation Model, which is similar to, but more powerful than, application-specific LLMs.7 As the GenAI field advances there will be more variants, and GenAI will be discussed more broadly in a later article.

In the next article in this series, published Monday 15 July, we will walk through a real science example using ChatGPT and provide the source information for you to replicate the example.

 

References

  1. Proffit A. The LOWE Down on Recursion’s New LLM Orchestration Work Engine from JP Morgan. BioIT World [Internet]. 2024 January 9 [cited 2024 June]. Available from: https://www.bio-itworld.com/news/2024/01/09/the-lowe-down-on-recursion-s-new-llm-orchestration-work-engine-from-jp-morgan 
  2. Insilico Medicine. Insilico and NVIDIA unveil new LLM transformer for solving biological and chemical tasks. News Medical & Life Sciences [Internet]. 2024 May 20 [cited 2024 June]. Available from: https://www.news-medical.net/news/20240520/Insilico-and-NVIDIA-unveil-new-LLM-transformer-for-solving-biological-and-chemical-tasks.aspx 
  3. ChatGPT. Available from: https://chatgpt.com/ 
  4. Sara Merken. New York lawyers sanctioned for using fake ChatGPT cases in legal brief. Reuters [Internet]. 2023 June 26 [cited 2024 June]. Available from: https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/ 
  5. Devansh. Will Diffusion Models Be The Next Frontier of Deep Learning. Medium [Internet]. 2024 May 12 [cited 2024 June]. Available from: https://medium.datadriveninvestor.com/will-diffusion-models-be-the-next-frontier-of-deep-learning-7172bea88581
  6. Sora Creators. What is the difference between a diffusion model and LLLM in simple terms. Sora Creators [Internet]. Available from: https://soracreators.ai/blog/What-is-the-difference-between-a-diffusion-model-and-LLM-in-simple-terms
  7. Novita AI. Foundational Model vs. LLM: Understanding the Differences. Medium [Internet]. 2024 May 13 [cited 2024 June]. Available from: https://medium.com/@marketing_novita.ai/foundational-model-vs-llm-understanding-the-differences-820a4428dbc3  

About the author

Dr Raminderpal Singh

Raminderpal SinghDr Raminderpal Singh is a recognised key opinion leader in the techbio industry. He has over 30 years of global experience leading and advising teams on building computational modelling systems that are both cost-efficient and have significant IP value. His passion is to help early to mid-stage life sciences companies achieve novel biological breakthroughs through the effective use of computational modelling.

Raminderpal is currently leading the HitchhikersAI.org open-source community, accelerating the adoption of AI technologies in early drug discovery. He is also CEO and co-Founder of Incubate Bio – a techbio providing a service to life sciences companies who are looking to accelerate their research and lower their wet lab costs through in silico modelling. 

Raminderpal has extensive experience building businesses in both Europe and the US. As a business executive at IBM Research in New York, Dr Singh led the go-to-market for IBM Watson Genomics Analytics. He was also Vice President and Head of the Microbiome Division at Eagle Genomics Ltd, in Cambridge. Raminderpal earned his PhD in semiconductor modelling in 1997. He has published several papers and two books and has twelve issued patents. In 2003, he was selected by EE Times as one of the top 13 most influential people in the semiconductor industry.

For more: http://raminderpalsingh.com ; http://hitchhikersAI.org ; http://incubate.bio