Scientific workflow for hypothesis testing in drug discovery: part 3 of 3

Share via

Posted: 14 February 2025 | Dr Raminderpal Singh (Hitchhikers AI and 20/15 Visioneers), Nina Truter | No comments yet

Drug discovery scientists develop and test complex hypotheses using data and expertise, and build workflows to support this. In this third and final article, Dr Raminderpal Singh and Nina Truter summarise the tools used in the scientific workflow – and include key considerations.

Medical technology concept. Remote medicine. Electronic medical record.

Throughout the workflow described in Figure 1, different tools can play a critical role in facilitating each stage of analysis. From hypothesis generation to data cleaning and interpretation, the appropriate use of tools can significantly improve the efficiency and accuracy of the research process.

1. Data visualisation and hypothesis generation tools

Tools like Miro, that generate diagrams, are essential for mapping out hypotheses. Miro allows researchers to create a visual representation of the relationships between proteins, genes or pathways, helping to clarify the expected interactions within the biological system being studied. This kind of image is particularly useful during the hypothesis generation phase, where researchers are still exploring the relationships between different biological components.

ChatGPT is useful for brainstorming and generating new research ideas, and can also be used to explore possible pathways or protein interactions by inputting key terms or genes. This tool, while useful for generating ideas, should be used cautiously. Though it can provide new pathways or hypotheses to investigate, it should not replace rigorous literature review or empirical evidence.

2. Data cleaning and descriptive analytics tools

Excel remains one of the most commonly used tools for data cleaning and descriptive analytics in many research settings. Researchers use Excel for tasks such as sorting data, identifying outliers, and generating basic plots. However, for larger datasets, Excel has its limitations in terms of both scalability and complexity. Tools like R and Python, with libraries such as Pandas for data manipulation and Matplotlib for visualisation, provide robust solutions for handling large datasets and performing advanced statistical analyses. Python’s SciPy and Statsmodels libraries, for example, offer advanced tools for hypothesis testing, regression analysis and other complex statistical procedures that surpass Excel’s capabilities. ChatGPT and Claude.ai are useful tools to empower scientists with no coding experience by providing custom-written code for specific analyses and execution of this code. Again, this is not a replacement for rigorous analyses by data scientists; however, where data scientists are not available, it enables exploration of the data beyond the capabilities of Excel.

Another powerful workflow tool is the KEGG Pathway database, which helps researchers map out how proteins and genes interact within known biological pathways. This is especially useful during the hypothesis testing phase, as it enables researchers to visualise how their proteins of interest fit into broader biological processes. The KEGG Pathway database provides insights into metabolic pathways, genetic interactions and disease mechanisms, which are crucial for understanding how a dataset can inform our understanding of complex biological phenomena such as signal transduction, cell proliferation, or immune responses.

Gene ontology databases, such as STRING and Reactome, are additional tools that can be used to understand protein-protein interactions and their involvement in cellular processes. These tools are essential for interpreting the results of data analysis, particularly when the dataset reveals unexpected or novel interactions between proteins that require further investigation.

3. Network and interaction mapping tools

As biological datasets grow in complexity, graph-based tools have become essential for visualising and analysing protein-protein interactions and gene networks. Cytoscape, for example, is a widely used software tool for visualising molecular interaction networks and integrating these with gene expression profiles and other data. In research focused on drug discovery, understanding the interactions between multiple proteins or genes are crucial for identifying potential drug targets or understanding the mechanisms behind drug resistance.

Network-based approaches are also becoming more prevalent as researchers aim to represent complex biological data in more intuitive ways. By visualising data as networks or graphs, scientists can more easily identify hubs, bottlenecks or key players in biological processes, allowing them to focus their efforts on the most critical components of a system.

4. Literature and data curation tools

Data curation is a key part of any workflow, particularly when working with large datasets or integrating data from multiple sources. Tools like GeneCards are useful for obtaining detailed information about genes and their functions. GeneCards offers comprehensive gene-related information, such as pathways, interactions and diseases associated with each gene. This information is invaluable when generating hypotheses or validating findings, since it provides a deeper understanding of how a particular gene or protein fits into the broader biological context.

In addition to GeneCards, tools like Mendeley and Zotero are beneficial for managing research papers and references, particularly for researchers who rely on literature reviews to support their hypotheses and analyses. Proper reference management ensures efficient source tracking and maintains the integrity of work.

5. AI and machine learning tools

As biological research datasets grow in size and complexity, the use of AI and machine learning tools becomes more critical. ChatGPT can function as a brainstorming tool for generating hypotheses or exploring possible pathways and while this tool is still relatively novel in the research community, it represents the growing intersection between AI and drug discovery. ChatGPT can assist by summarising literature, suggesting new angles of inquiry, or even helping to explore large datasets in ways that would be too time-consuming for manual review.

Other machine learning tools, such as TensorFlow or PyTorch, can be used to analyse large datasets and identify patterns that may not be immediately apparent through traditional methods. These tools allow researchers to build predictive models, classify data, or identify novel associations between variables. In drug discovery, machine learning models have been used to predict drug efficacy, optimise compound structures and even simulate biological systems.

Figure 1: An illustration to demonstrate a high-level workflow for early drug discovery.

Summary of tools and databases used in the workflow:

KEGG Pathway database – The KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway database provides information on molecular interaction and reaction networks for various biological pathways. https://www.kegg.jp/kegg/pathway.html
STRING database – a database of known and predicted protein-protein interactions, integrating both physical and functional associations. https://string-db.org
Reactome – an open-source, curated pathway database that provides insights into biological processes and molecular interactions. https://reactome.org
GeneCards – a comprehensive database that provides detailed information on all known and predicted human genes, including functions, pathways and related diseases. https://www.genecards.org
Cytoscape – a software platform for visualising molecular interaction networks and integrating these networks with gene expression profiles and other data. https://cytoscape.org
Mendeley – a reference manager and academic social network that helps researchers organise research papers, collaborate online and discover the latest scientific research. https://www.mendeley.com
Zotero – a free, easy-to-use tool to help researchers collect, organise, cite and share research. https://www.zotero.org
TensorFlow – an open-source platform for machine learning, commonly used for deep learning applications and large dataset analysis. https://www.tensorflow.org
PyTorch – an open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing. https://pytorch.org

About the author

Dr Raminderpal Singh

Dr Raminderpal Singh is a recognised visionary in the implementation of AI across technology and science-focused industries. He has over 30 years of global experience leading and advising teams, helping early to mid-stage companies achieve breakthroughs through the effective use f computational modelling. Raminderpal is currently the Global Head of AI and GenAI Practice at 20/15 Visioneers. He also founded and leads the HitchhikersAI.org open-source community. He is also a co-founder of Incubate Bio – a techbio that helps life sciences companies accelerate their research and lower their wet lab costs through in silico modelling.

Raminderpal has extensive experience building businesses in both Europe and the US. As a business executive at IBM Research in New York, Dr Singh led the go-to-market for IBM Watson Genomics Analytics. He was also Vice President and Head of the Microbiome Division at Eagle Genomics Ltd in Cambridge. Raminderpal earned his PhD in semiconductor modelling in 1997. He has published several papers and two books and has twelve issued patents. In 2003, he was selected by EE Times as one of the top 13 most influential people in the semiconductor industry.

For more: http://raminderpalsingh.com; http://20visioneers15.com; http://hitchhikersAI.org; http://incubate.bio

Nina Truter

Nina Truter is a translational scientist with a deep focus on understanding mechanisms of action in drug development and leveraging disparate datasets in biotech. Based in South Africa, she has worked extensively with international biotech companies, specialising in therapeutic development for ageing-related diseases and complex conditions such as glioblastoma and Autosomal dominant polycystic kidney disease (ADPKD).

Her recent work includes consulting for UK-based biotech firms and leading initiatives in HitchhikersAI.org to advance the translation of AI and data science into practical biotech solutions such as identifying combination therapy opportunities and enhancing patient selection. In her work, she uses a systems approach to integrate insights from diverse datasets across in vitro, in vivo and human models to answer critical scientific questions, and translates biological mechanisms into models that are used by advanced analytical methods such as Pearlian causal inference.

For more: https://njtruter.wixsite.com/ninatruter

Related people
Dr Raminderpal Singh, Nina Truter

Cookie	Type	Duration	Description
cookielawinfo-checkbox-advertising-targeting	persistent	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	persistent	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	session	1 year	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	persistent	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	session	1 year	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Type	Duration	Description
advanced_ads_browser_width	persistent	1 month	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	persistent	2 years	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	persistent	1 month	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	persistent	1 year	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	persistent	2 years	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	persistent	2 years	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	persistent	3 months	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	persistent	1 month	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	persistent	5 months	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
bcookie	persistent	2 years	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	persistent	30 minutes	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	session	1 year	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	persistent	1 day	This cookie is set by LinkedIn and used for routing.
lissc	persistent	11 months	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	persistent	2 years	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	persistent	2 years	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	persistent	20 minutes	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	persistent	20 minutes	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	persistent	20 minutes	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	persistent	2 years	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	persistent	1 minute	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	persistent	1 day	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Type	Duration	Description
cf_ob_info	persistent	1 minute	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	persistent	1 minute	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	session	1 year	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	persistent	1 month	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	persistent	Until cleared	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	session	1 year	This cookie is set by Youtube and is used to track the views of embedded videos.

Recommended

Scientific workflow for hypothesis testing in drug discovery: part 3 of 3

1. Data visualisation and hypothesis generation tools

2. Data cleaning and descriptive analytics tools

3. Network and interaction mapping tools

4. Literature and data curation tools

5. AI and machine learning tools

Summary of tools and databases used in the workflow:

Leave a Reply Cancel reply

Recommended

Scientific workflow for hypothesis testing in drug discovery: part 3 of 3

1. Data visualisation and hypothesis generation tools

2. Data cleaning and descriptive analytics tools

3. Network and interaction mapping tools

4. Literature and data curation tools

5. AI and machine learning tools

Summary of tools and databases used in the workflow:

The rise of organoids: a new era in drug discovery

New leadership at ELRIG aims to expand collaboration across drug discovery

Spirulina’s role in shaping the future of preventative biologics

Early evidence and emerging trends: How AI is shaping drug discovery and clinical development

Gylden’s gold platform trains T cells to kill viruses

Leave a Reply Cancel reply