Linter on the model / data #572

adrinjalali · 2024-07-05T12:40:38Z

adrinjalali
Jul 5, 2024
Collaborator

A first step to help users to do better modeling is to help them apply linters on the data and the model.

This in a lot of cases would require access to user's code, and in some cases possible to do by only having access to the data and the model.

We should have a way to incorporate those liters, either external or developed by us, whatever information we have about the user's pipeline / code / data.

Some existing work:

This issue is to figure out how we can get to a starting point, apply some linters to our artifacts and/or user code, and have a baseline / infra to write more in depth linters.

cc @glemaitre @ogrisel

@ogrisel also had other existing examples in mind I think.

koaning · 2024-07-06T08:08:52Z

koaning
Jul 6, 2024
Collaborator

What should a data linter do in general? I can come up with a lot of sensible things to check in practise when there is domain knowledge, but I am wondering how general one might be able to lint a dataset on its own.

Maybe dataset + model is easier though.

0 replies

adrinjalali · 2024-07-08T05:01:07Z

adrinjalali
Jul 8, 2024
Collaborator Author

By a quick look at the links in the description I noticed certain sanity checks such as "is your datetime encoded as datetime?" kinda thing. This list doesn't have to be very exhaustive, but there are a few things we can put there. Other things like "this feature is constant" or "this feature is almost certainly always missing".

0 replies

glemaitre · 2024-08-09T10:54:21Z

glemaitre
Aug 9, 2024
Maintainer

I did not see this issue but I agree this is a good starting point.

I would had the following repo: https://github.com/ydataai/ydata-profiling. This is only for profiling data but this is useful to check the type of checks that they have.

We also have nowadays TableReport: https://skrub-data.org/stable/reference/generated/skrub.TableReport.html#skrub.TableReport. It could contains information that could be useful to make a linter.

0 replies

tuscland · 2024-08-13T07:37:16Z

tuscland
Aug 13, 2024
Maintainer

A summary of each package:

ydata-profiling

Purpose: Generate profiling reports on data in one line of code.
Type: Data profiling and reporting

Open-source library behind YData Fabric, their main product. Focus on EDA.

Features
Comparing datasets
Profiling a Time-Series dataset
Profiling large datasets
Handling sensitive data
Dataset metadata and data dictionaries
Customizing the report's appearance
Profiling Relational databases **
PII classification & management **

skrub's TableReport

Purpose: The TableReport generates a quick self-contained report in HTML.
Type: Data profiling and reporting

It would be interesting to demonstrate Mandr's automatic report building capabilities by replicating skrub's TableReport.

Features:

Column summaries, displaying
1. Column name
2. Column type
3. Strings
  1. Null and Unique values (count and percentage)
  2. 10 most frequent values with their relative distribution and an indication of the absolute count.
4. Integers
  1. Null values (count and percentage)
  2. Mean ± Std
  3. Median ± IQR
  4. Mix ; Max.
  5. Histogram
5. Maybe more types
Column similarities
1. List of column pairs and their Cramér’s V number which represent the strength of association between the two.

dslinter

Purpose: PyLint plugin for linting data science and machine learning code. Supports many libraries including sklearn. Based on PyLint (and astroid for AST management).
Type: Code checker

Was developed as part of a Master's Thesis.

Features:
See this list: https://github.com/SERG-Delft/dslinter?tab=readme-ov-file#implemented-checkers

mllint

Purpose: Evaluate technical (not DS) quality of Python ML projects.
Type: Code checker

Meta-linter written in go driving other linters covering different categories (VCS, deps, CI, Code quality, Testing).
Was developed as part of a Master's Thesis.

data-linter

Purpose: Identifies potential issues (lints) in your ML training data. Unlike other data linters, it does not seem to be based on publicly available research.
Type: Data profiling and reporting

Was developed as part of an internship at Google.

Cleanlab Datalab

Purpose: Automatically checks for common types of real-world issues in your dataset.
Type: Data profiling and reporting / Code checker

Features:

Outliers
Label errors
(Near) duplicate examples
[Non-IID sampling](https://cleanlab.ai/blog/non-iid-detection/) (drift, autocorrelation, …)
[Low-quality](https://cleanlab.ai/blog/cleanvision/)/ambiguous examples
Your own custom issue types can be easily added to the set that Datalab checks for

data_linter

Purpose: Automatic validation of data as part of a Data Engineering pipeline, delivered as a Docker image. Oriented towards popular pandas checks. Has some parquet checks.
Type: Data profiling and reporting

More oriented towards MLOps

0 replies

tuscland · 2024-08-13T07:56:31Z

tuscland
Aug 13, 2024
Maintainer

Each tool has its own take on the analysis domain, each tool is very different.

It seems that business-backed projects (ydata-profiling, datalab) have a more holistic approach, by that I mean that they may have started with a small core, and developed it as customer requested features. In that regard, Ydata-profiling seems to differentiate with sensitive data related features, while Datalab seems to have a more general approach meant to help for automatic labelling.

So what I mean, is that there is an underlying reason why those linters exists, and I believe it is driven by the market those companies try to address.

Open source projects on the other hand seem to focus on coding best practices, or helpers to quickly discover data.

The next steps would be:

Build a PoC of support for ML/DS-oriented code linting
Better understand how data linters work in details, how they differ and what are the challenges

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linter on the model / data #572

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Linter on the model / data #572

adrinjalali Jul 5, 2024 Collaborator

Replies: 5 comments

koaning Jul 6, 2024 Collaborator

adrinjalali Jul 8, 2024 Collaborator Author

glemaitre Aug 9, 2024 Maintainer

tuscland Aug 13, 2024 Maintainer

ydata-profiling

skrub's TableReport

dslinter

mllint

data-linter

Cleanlab Datalab

data_linter

tuscland Aug 13, 2024 Maintainer

adrinjalali
Jul 5, 2024
Collaborator

koaning
Jul 6, 2024
Collaborator

adrinjalali
Jul 8, 2024
Collaborator Author

glemaitre
Aug 9, 2024
Maintainer

tuscland
Aug 13, 2024
Maintainer

tuscland
Aug 13, 2024
Maintainer