Automated data validation for ML teams
Find data quality issues before they break your models.
DataLint learns from clean datasets to automatically validate new data and prevent ML training failures. It catches the data quality issues that cause 60% of ML project failures before they break your models.
| Feature | Description |
|---|---|
| Zero Configuration | Works out of the box with sensible defaults |
| ML-Focused | Optimized specifically for model training data quality |
| Learn from Data | Automatically generates validation rules from clean datasets |
| Schema Drift Detection | Catches when production data differs from training data |
| CI/CD Ready | JSON output for integration with automated pipelines |
pip install datalintRequirements: Python 3.8+
datalint validate mydata.csvOutput:
Loaded dataset: 150 rows x 5 columns
missing_values: No missing values found
data_types: Data types appear consistent
outliers: Outlier levels appear normal
correlations: Found 1 highly correlated feature pairs
constant_columns: Found 1 columns with constant values
Summary: 3 passed, 1 warnings, 1 failed
Tip: Address failed checks before training ML models
# Create a validation profile from your training data
datalint profile training_data.csv --learn
# Validate new data against the learned profile
datalint profile new_data.csv --profile training_data_profile.jsondatalint validate data.csv --format json --output results.jsonDataLint performs five core validation checks:
Identifies columns with excessive null values that will crash or degrade ML models.
# Example: 43% missing values in 'age' column
# Recommendation: Impute or remove before trainingDetects mixed types (e.g., numbers and text in the same column) that cause parsing errors.
# Example: price column has [10.99, 25.50, 'FREE', 15.00]
# Recommendation: Convert to consistent typeUses the IQR (Interquartile Range) method to find statistical anomalies that can dominate model training.
# Example: salary column has values [50k, 55k, 48k, 5M]
# Recommendation: Investigate or cap extreme valuesFinds feature pairs with >95% correlation that provide redundant information.
# Example: height_cm and height_inches are 100% correlated
# Recommendation: Remove one redundant featureDetects columns with zero variance that provide no predictive information.
# Example: 'country' column is 'USA' for all rows
# Recommendation: Remove before training| Feature | DataLint | Great Expectations | Pandera | Deequ |
|---|---|---|---|---|
| Zero config | Yes | No (YAML required) | No (schema required) | No |
| Auto-learn rules | Yes | No | No | Partial |
| ML-focused | Yes | General | General | General |
| Setup time | 5 minutes | Hours/Days | Hours | Hours |
| Pricing | Free | Free | Free | Free (AWS) |
datalint/
├── cli.py # Command-line interface
├── engine/
│ ├── validators.py # Core validation checks
│ ├── learner.py # Rule learning from clean data
│ └── profiler.py # Statistical profiling
└── utils/
├── io.py # File loading (CSV, Excel, Parquet)
└── reporting.py # Output formatter (text, JSON, HTML)
Shows the class hierarchy and relationships
classDiagram
class BaseValidator {
<<abstract>>
+String name*
+ValidationResult validate(DataFrame df)*
+String __repr__()
}
class Formatter {
<<abstract>>
+String format(List~ValidationResult~ results)*
}
class ValidationResult {
+String name
+String status
+String message
+List issues
+List recommendations
+Dict details
+Boolean passed
+Dict to_dict()
}
class ValidationRunner {
-List~BaseValidator~ validators
+ValidationRunner(List~BaseValidator~ validators)
+void add_validator(BaseValidator validator)
+List~ValidationResult~ run(DataFrame df)
+Dict~String,ValidationResult~ run_dict(DataFrame df)
}
class ConcreteValidator {
+String name
+ValidationResult validate(DataFrame df)
}
class ConcreteFormatter {
+String format(List~ValidationResult~ results)
}
BaseValidator <|.. ConcreteValidator : implements
Formatter <|.. ConcreteFormatter : implements
ValidationRunner --> BaseValidator : uses
BaseValidator --> ValidationResult : returns
ConcreteValidator --> ValidationResult : returns
Shows key interfaces and abstraction contracts
classDiagram
class BaseValidator {
<<abstract>>
+name: str*
+validate(df: DataFrame): ValidationResult*
}
class Formatter {
<<abstract>>
+format(results: List[ValidationResult]): str*
}
class ValidationResult {
+name: str
+status: Literal['passed', 'warning', 'failed']
+message: str
+issues: List
+recommendations: List
+details: Dict
+passed: bool
+to_dict(): Dict
}
class ValidationRunner {
-validators: List[BaseValidator]
+__init__(validators=None)
+add_validator(validator: BaseValidator)
+run(df: DataFrame): List[ValidationResult]
+run_dict(df: DataFrame): Dict[str, ValidationResult]
}
BaseValidator <|.. ConcreteValidator : implements
Formatter <|.. ConcreteFormatter : implements
ValidationRunner --> BaseValidator : uses
BaseValidator --> ValidationResult : returns
Illustrates high-level software components
graph TD
CLI[Command Line Interface]
ENG[Core Validation Engine]
UTI[Utility Functions]
CLI --> ENG
CLI --> UTI
ENG --> UTI
Shows how the system is deployed
graph TD
subgraph Local[Local Machine]
Python[Python Environment]
DataLint[DataLint Package]
end
Data[Data Files]
Reports[Output Reports]
DataLint --> Data
DataLint --> Reports
Python --> DataLint
Displays the validation workflow sequence
sequenceDiagram
participant U as User
participant C as CLI
participant V as ValidationRunner
participant B as BaseValidator
participant D as DataFrame
U->>C: datalint validate file.csv
C->>V: run(df)
loop for each validator
V->>B: validate(df)
B->>D: analyze data
D-->>B: return analysis
B-->>V: ValidationResult
end
V-->>C: results list
C-->>U: formatted output
Shows the validation pipeline activities
flowchart TD
Start([Start])
Run[User runs datalint validate]
Parse[Parse command line arguments]
Load[Load data file]
Check{File loaded successfully?}
Init[Initialize ValidationRunner]
Validate[Run all validators]
CheckResult{Validation passed?}
Success[Generate success report]
Fail[Generate failure report]
Recomm[Show recommendations]
Error[Show error message]
Exit([Exit])
Start --> Run
Run --> Parse
Parse --> Load
Load --> Check
Check -->|Yes| Init
Init --> Validate
Validate --> CheckResult
CheckResult -->|Yes| Success
CheckResult -->|No| Fail
Fail --> Recomm
Success --> Exit
Recomm --> Exit
Check -->|No| Error
Error --> Exit
Illustrates user interactions with the system
flowchart LR
DS([Data Scientist])
MLE([ML Engineer])
DevOps([DevOps Engineer])
UC1[Validate Dataset]
UC2[Learn from Clean Data]
UC3[Profile Data Quality]
UC4[Generate Reports]
UC5[CI/CD Integration]
DS --> UC1
DS --> UC2
MLE --> UC3
DevOps --> UC5
UC1 --> UC4
UC2 --> UC4
UC3 --> UC4
- Phase 1: Core validation engine with CLI
- Phase 2: Learning system (auto-generate rules from clean data)
- Phase 3: HTML reports + GitHub Actions integration
- Phase 4: Web dashboard + team collaboration
DataLint is in active development. We welcome contributions:
- Bug Reports: Open an issue with reproduction steps
- Feature Requests: Describe your use case
- Pull Requests: See
CONTRIBUTING.mdfor guidelines - Feedback: Share your experience using DataLint
MIT License - see LICENSE for details.
DataLint - Because good models start with good data.
