Part three: 15 pragmatic guidelines to handle data quality issues

Share via

Posted: 13 September 2024 | Dr Raminderpal Singh (Hitchhikers AI and 20/15 Visioneers) | No comments yet

In this four-part series, Dr Raminderpal Singh will discuss the challenges surrounding limited data quality, and some pragmatic solutions. In this third article, he discusses pragmatic guidelines to help support better data quality.

In the first article in this series, published Wednesday 14 August, we discussed the importance that data quality plays in the effectiveness of data analyses by Machine Learning (ML) and AI. In part two, we looked at the impact that poor data quality can have on ML models. Here, we outline 15 pragmatic guidelines to ensure better data quality in early drug discovery and to identify potential issues with compromised data.

Data collection and entry

Standardisation, by implementing standardised operating procedures (SOPs) for data collection and entry, is a key consideration. This includes consistent use of units, naming conventions, and data formats. The team involved in data collection should be trained on the importance of data quality and the specific protocols to follow in order to minimise errors.

Data validation

Automated checks should be carried out with automated validation scripts, which check for common issues such as missing values, duplicates, outliers and inconsistencies in units or formats. Furthermore, manual reviews of a subset of the data should be performed periodically to identify any issues that automated checks might miss.

Data cleaning

A strategy should be developed for handling missing data, such as deciding when to use imputation, exclude data points, or flag datasets for further investigation. Also, methods should be implemented to identify and investigate outliers, determining whether they represent true variability or errors.

Data integration

It is essential to ensure that data from different sources or experiments are harmonised before integration. This includes reconciling different naming conventions, units and formats. Consistency and correctness can be checked by cross validation, in which cross-referencing methods validate integrated datasets.

Data documentation

Researchers should maintain detailed metadata for each dataset, including information about the origin, collection method, and any preprocessing steps. This helps in tracking data provenance and understanding the context. Also, version control systems for datasets should be used to track changes and ensure that any modifications are well documented and reversible.

Data monitoring

Data should be monitored continuously, recording quality metrics like completeness, accuracy, and consistency, throughout the data lifecycle. Moreover, automated alerts can be set up to notify relevant personnel if data quality metrics fall below predefined thresholds.

Data auditing

Conduct regular data audits to assess the overall quality of your datasets. This involves checking for adherence to data quality standards and identifying any systemic issues. Also, maintain audit trails that log all data processing steps, transformations, and any changes made to the data to ensure traceability and accountability.

Bias and variability checks

Regularly assess your datasets for potential biases, such as over-representation of certain chemical scaffolds or biological targets. Statistical techniques should be used to quantify bias and take corrective actions. Analyse the variability in your data, particularly in biological assays, to understand the level of noise and its impact on model performance.

Data redundancy and duplication checks

Implement robust mechanisms for detecting and removing duplicate records, to prevent skewing of the data. Then, techniques such as correlation analysis can be used to identify and eliminate redundant features that do not contribute new information.

Data imbalance handling

Continuously monitor the balance of different classes in your data (eg, active versus inactive compounds). Address imbalances through methods such as over-sampling, under-sampling, or synthetic data generation. If data imbalance is unavoidable, consider using model algorithms that are better suited to handle imbalanced data.

Data security and access control

Security measures can be implemented to protect data from corruption, loss or unauthorised access, ensuring that data integrity is maintained. With this, access to data can be restricted based on roles and responsibilities to prevent unauthorised modifications or data entry errors.

Communication and collaboration

Foster collaboration between data scientists, domain experts and IT professionals to ensure that data quality requirements are clearly understood and addressed. A feedback loop should be established, whereby issues identified by data scientists or model results are communicated back to the experimental team to refine data collection processes.

Use of quality control samples

Include quality control samples (eg, known standards or replicates) in experimental runs to monitor and ensure consistency in assay performance. Regularly analyse the results of quality control samples to identify any drifts or deviations in experimental conditions that could impact data quality.

Data quality metrics and reporting

Define specific data quality metrics such as accuracy, completeness, consistency, timeliness, and uniqueness. Use these metrics to evaluate and report on the quality of your data regularly.

When reporting data quality metrics to stakeholders, ensure transparency to facilitate continuous improvement.

Continuous improvement

When data quality issues are identified, perform a root cause analysis to understand the underlying reasons and implement corrective actions. You should treat data quality improvement as an iterative process, continuously refining and enhancing your strategies as new challenges and technologies emerge.

In the next article, which will be published Tuesday 24 September, we will present views from an industry veteran on the topic of data quality.

About the author

Dr Raminderpal Singh

Dr Raminderpal Singh is a recognised visionary in the implementation of AI across technology and science-focused industries. He has over 30 years of global experience leading and advising teams, helping early to mid-stage companies achieve breakthroughs through the effective use of computational modelling.

Raminderpal is currently the Global Head of AI and GenAI Practice at 20/15 Visioneers. He also founded and leads the HitchhikersAI.org open-source community. He is also a co-founder of Incubate Bio – a techbio providing a service to life sciences companies who are looking to accelerate their research and lower their wet lab costs through in silico modelling.

Raminderpal has extensive experience building businesses in both Europe and the US. As a business executive at IBM Research in New York, Dr Singh led the go-to-market for IBM Watson Genomics Analytics. He was also Vice President and Head of the Microbiome Division at Eagle Genomics Ltd, in Cambridge. Raminderpal earned his PhD in semiconductor modelling in 1997. He has published several papers and two books and has twelve issued patents. In 2003, he was selected by EE Times as one of the top 13 most influential people in the semiconductor industry.

For more: http://raminderpalsingh.com; http://20visioneers15.com; http://hitchhikersAI.org; http://incubate.bio

Related topics
Artificial Intelligence, Targets

Related organisations
20/15 Visioneers, HitchhikersAI

Cookie	Type	Duration	Description
cookielawinfo-checkbox-advertising-targeting	persistent	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	persistent	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	session	1 year	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	persistent	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	session	1 year	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Type	Duration	Description
advanced_ads_browser_width	persistent	1 month	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	persistent	2 years	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	persistent	1 month	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	persistent	1 year	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	persistent	2 years	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	persistent	2 years	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	persistent	3 months	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	persistent	1 month	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	persistent	5 months	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
bcookie	persistent	2 years	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	persistent	30 minutes	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	session	1 year	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	persistent	1 day	This cookie is set by LinkedIn and used for routing.
lissc	persistent	11 months	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	persistent	2 years	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	persistent	2 years	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	persistent	20 minutes	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	persistent	20 minutes	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	persistent	20 minutes	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	persistent	2 years	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	persistent	1 minute	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	persistent	1 day	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Type	Duration	Description
cf_ob_info	persistent	1 minute	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	persistent	1 minute	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	session	1 year	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	persistent	1 month	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	persistent	Until cleared	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	session	1 year	This cookie is set by Youtube and is used to track the views of embedded videos.

Recommended

Part three: 15 pragmatic guidelines to handle data quality issues

Leave a Reply Cancel reply

Recommended

Part three: 15 pragmatic guidelines to handle data quality issues

AI-powered drug discovery: Accelerating the development of life-saving therapies

SOD1 protein found to trigger treatable Parkinson’s progression

Future-proofing drug development with GenAI

The AI model that is changing clinical trial design

Bird flu is changing – AI might help us keep up

Leave a Reply Cancel reply