Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
Sep 5, 2025 - Python
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Refine high-quality datasets and visual AI models
A Doctor for your data
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
Scalable data pre processing and curation toolkit for LLMs
[ICLR 2025] Official implementation of paper "Improving Data Efficiency via Curating LLM-Driven Rating Systems"
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
Client interface to Cleanlab Studio
Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.
Rebalancing chemical reaction
Data Cleaning and Data Profiling Library for Python
Reaction data exploration: a map of reagents with regions of similar reagent purpose.
tranSMART Arborist ETL toolkit
HISDAC-ES: Creating historical settlement data for Spain (1900-2020) based on cadastral building footprint data
Add a description, image, and links to the data-curation topic page so that developers can more easily learn about it.
To associate your repository with the data-curation topic, visit your repo's landing page and select "manage topics."