Part one: an introduction to data quality

In this four-part series, Dr Singh will discuss the challenges surrounding limited data quality, and some pragmatic solutions. In this first article, the key attributes that define data quality and its requirement for data scientists are elucidated.

Early drug discovery involves conducting experiments, data generation, and testing scientific hypotheses. For scientists in early drug discovery, their success is heavily dependent on the quality of the data generated. For example, the following questions must be considered: are we generating the right data from the right experiments, and are we applying the best methods for data analysis and interpretation?

Data quality (sometimes called data integrity) is a frequently used term. It is of special importance from data acquisition to data analysis in a workflow, as the effectiveness of data analyses by Machine Learning (ML) and AI can be severely compromised by poor data quality. Characteristics that define data quality are as follows:

Completeness
Consistency
Lack of bias
Accuracy

Data quality is a large topic across many industries where data engineers specialise in “data ops” and build advanced test pipelines to automatically detect issues. In early drug discovery, the data pipelines tend to be more limited (with genomics data being the exception). For example, data quality is often about:

a measurement issue from a lab machine – eg missing data in a csv file
a poorly designed experiment - eg too much variability in the data
poor data from non-controlled environments - eg observation data from human studies

Data-generation costs are forever reducing, as technology (both hardware and AI) becomes cheaper and more effective. Over the last couple of decades, this has led to ready access to large amounts of data - creating the paradigm of “big data”.

In this new paradigm, the data quality issues multiply up. An error that was previously easy to detect in a small dataset, saved as a csv file for example, can be much harder to detect in a large complex dataset which may need to be part of a structured database. This then results in a need for data scientists. Although we are not quite at the point of requiring data ops, that need may arrive soon.

Data science and data scientists are terms used across industries and mean different things to different people - primarily defined by the job to be done. For early drug discovery, we should focus on two skills: data engineering and scientific data understanding.

Data engineering is the ability to describe, summarise, format and clean data.
Scientific data understanding is the ability to explain the scientific rationale behind the dataset labels and the implication of, for example, missing data.

A data scientist in early drug discovery needs to be able to do both of the above competently, involving a mix of software, science, and ML skills. The challenge that hiring managers face is that academic courses tend to focus on the individual pillars separately. The above multi-disciplinary skills are often learnt in industry, making good data scientists difficult to find.

In the next article in this series, which will be published Monday 26 August, we will discuss the problems that can occur when data quality is compromised.

Furthermore, there are several industry and academic efforts to accelerate the sharing and use of data across the life sciences industry, which we will also discuss in a future article.