article

Part two: the impact of poor data quality

In this four-part series, Dr Raminderpal Singh will discuss the challenges surrounding limited data quality, and some pragmatic solutions. In this second article, he discusses the problems that occur when using data of poor quality.

Data set

In the first article in this series, published on Wednesday 14th August, we discussed the importance that data quality plays in the effectiveness of data analyses by Machine Learning (ML) and AI. Characteristics that define data quality were listed as:

  • Completeness
  • Consistency
  • Lack of bias
  • Accuracy

To understand the impact of these characteristics, one needs to appreciate the sensitivity that algorithms have to them, as well as the importance of choosing the right algorithm for the analysis being planned. These are complex topics requiring detailed explanations that will be discussed in future articles. 

Below are some of the key problems that poor data quality creates:

  • Incomplete data can cause ML models to miss patterns or relationships. For example, if crucial bioactivity data for certain compounds is missing, the model may not fully account for the structure-activity relationships, leading to inaccurate predictions.
  • Inconsistencies, such as variations in how data is recorded (eg, different units of measurement or naming conventions), can confuse models and lead to erroneous predictions. For instance, if the same compound is labelled differently in different datasets, the model might treat it as different entities, skewing the results.
  • Bias in data can lead to models that are not generalisable or that perform poorly on certain subsets of data. For instance, if your training data is biased toward a particular chemical scaffold or a specific set of biological targets, the model might be less effective in predicting the activity of compounds outside these categories.
  • Noise in data, which may arise from experimental errors, variability in biological assays, or inconsistent conditions, can obscure true signals and reduce the model’s ability to learn relevant patterns. This can result in a higher rate of false positives or negatives.
  • Duplicate records can distort the training process by giving undue weight to certain data points, leading to overfitting and a lack of generalisation in the model.
  • In cases where the data is highly imbalanced, such as having many more inactive compounds than active ones, models can become biased towards predicting the majority class. This might lead to poor performance in identifying active compounds.
  • Redundant features or data points can inflate the dimensionality of the data without adding new information, leading to overfitting and reduced model performance.
  • Scientific reproducibility is compromised when data quality is poor, as other researchers or systems might not replicate the findings.

In the next article in this series, which will be published Friday 13th September, we will discuss pragmatic guidelines to help support better data quality and to discover if your data quality is compromised.  

About the author

Dr Raminderpal Singh

Raminderpal SinghDr Raminderpal Singh is a recognised visionary in the implementation of AI across technology and science-focused industries. He has over 30 years of global experience leading and advising teams, helping early to mid-stage companies achieve breakthroughs through the effective use of computational modelling. 

Raminderpal is currently the Global Head of AI and GenAI Practice at 20/15 Visioneers. He also founded and leads the HitchhikersAI.org open-source community. He is also a co-founder of Incubate Bio – a techbio providing a service to life sciences companies who are looking to accelerate their research and lower their wet lab costs through in silico modelling. 

Raminderpal has extensive experience building businesses in both Europe and the US. As a business executive at IBM Research in New York, Dr Singh led the go-to-market for IBM Watson Genomics Analytics. He was also Vice President and Head of the Microbiome Division at Eagle Genomics Ltd, in Cambridge. Raminderpal earned his PhD in semiconductor modelling in 1997. He has published several papers and two books and has twelve issued patents. In 2003, he was selected by EE Times as one of the top 13 most influential people in the semiconductor industry.

For more: http://raminderpalsingh.com; http://20visioneers15.com; http://hitchhikersAI.org; http://incubate.bio 

Leave a Reply

Your email address will not be published. Required fields are marked *