GitHub - mikulskibartosz/check-engine: Data validation library for PySpark 3.0.0

Summary

The goal of this project is to implement a data validation library for PySpark. The library should detect the incorrect structure of the data, unexpected values in columns, and anomalies in the data.

How to install

pip install checkengine==0.2.0

How to use

from checkengine.validate_df import ValidateSparkDataFrame

result = ValidateSparkDataFrame(spark_session, spark_data_frame) \
        .is_not_null("column_name") \
        .are_not_null(["column_name_2", "column_name_3"]) \
        .is_min("numeric_column", 10) \
        .is_max("numeric_column", 20) \
        .is_unique("column_name") \
        .are_unique(["column_name_2", "column_name_3"]) \
        .is_between("numeric_column_2", 10, 15) \
        .has_length_between("text_column", 0, 10) \
        .mean_column_value("numeric_column", 10, 20) \
        .median_column_value("numeric_column", 5, 15) \
        .text_matches_regex("text_column", "^[a-z]{3,10}$") \
        .one_of("text_column", ["value_a", "value_b"]) \
        .one_of("numeric_column", [123, 456]) \
        .execute()

result.correct_data #rows that passed the validation
result.erroneous_data #rows rejected during the validation
results.errors a summary of validation errors (three fields: column_name, constraint_name, number_of_errors)

How to build

Install the Poetry build tool.
Run the following commands:

cd check-engine-lib
poetry build

How to test locally

Run all tests

cd check-engine-lib
poetry run pytest tests/

Run a single test file

cd check-engine-lib
poetry run pytest tests/test_between_integer.py

Run a single test method

cd check-engine-lib
poetry run pytest tests/test_between_integer.py -k 'test_should_return_df_without_changes_if_all_are_between'

How to test in Docker

docker build -t check-engine-test check-engine-lib/. && docker run check-engine-test

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
check-engine-lib		check-engine-lib
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

How to install

How to use

How to build

How to test locally

Run all tests

Run a single test file

Run a single test method

How to test in Docker

About

Releases 1

Packages

Languages

License

mikulskibartosz/check-engine

Folders and files

Latest commit

History

Repository files navigation

Summary

How to install

How to use

How to build

How to test locally

Run all tests

Run a single test file

Run a single test method

How to test in Docker

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages