Linter on the model / data #572
Replies: 5 comments
-
What should a data linter do in general? I can come up with a lot of sensible things to check in practise when there is domain knowledge, but I am wondering how general one might be able to lint a dataset on its own. Maybe dataset + model is easier though. |
Beta Was this translation helpful? Give feedback.
-
By a quick look at the links in the description I noticed certain sanity checks such as "is your datetime encoded as datetime?" kinda thing. This list doesn't have to be very exhaustive, but there are a few things we can put there. Other things like "this feature is constant" or "this feature is almost certainly always missing". |
Beta Was this translation helpful? Give feedback.
-
I did not see this issue but I agree this is a good starting point. I would had the following repo: https://github.com/ydataai/ydata-profiling. This is only for profiling data but this is useful to check the type of checks that they have. We also have nowadays |
Beta Was this translation helpful? Give feedback.
-
A summary of each package: ydata-profilingPurpose: Generate profiling reports on data in one line of code.
Features skrub's TableReportPurpose: The TableReport generates a quick self-contained report in HTML.
Features:
dslinterPurpose: PyLint plugin for linting data science and machine learning code. Supports many libraries including sklearn. Based on PyLint (and astroid for AST management).
Features: mllintPurpose: Evaluate technical (not DS) quality of Python ML projects.
data-linterPurpose: Identifies potential issues (lints) in your ML training data. Unlike other data linters, it does not seem to be based on publicly available research.
Cleanlab DatalabPurpose: Automatically checks for common types of real-world issues in your dataset. Features:
data_linterPurpose: Automatic validation of data as part of a Data Engineering pipeline, delivered as a Docker image. Oriented towards popular pandas checks. Has some parquet checks.
|
Beta Was this translation helpful? Give feedback.
-
Each tool has its own take on the analysis domain, each tool is very different. It seems that business-backed projects (ydata-profiling, datalab) have a more holistic approach, by that I mean that they may have started with a small core, and developed it as customer requested features. In that regard, Ydata-profiling seems to differentiate with sensitive data related features, while Datalab seems to have a more general approach meant to help for automatic labelling. So what I mean, is that there is an underlying reason why those linters exists, and I believe it is driven by the market those companies try to address. Open source projects on the other hand seem to focus on coding best practices, or helpers to quickly discover data. The next steps would be:
|
Beta Was this translation helpful? Give feedback.
-
A first step to help users to do better modeling is to help them apply linters on the data and the model.
This in a lot of cases would require access to user's code, and in some cases possible to do by only having access to the data and the model.
We should have a way to incorporate those liters, either external or developed by us, whatever information we have about the user's pipeline / code / data.
Some existing work:
This issue is to figure out how we can get to a starting point, apply some linters to our artifacts and/or user code, and have a baseline / infra to write more in depth linters.
cc @glemaitre @ogrisel
@ogrisel also had other existing examples in mind I think.
Beta Was this translation helpful? Give feedback.
All reactions