-
-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: drop invalid rows on validate with new param #1189
Merged
cosmicBboy
merged 38 commits into
unionai-oss:main
from
kykyi:feature/drop-invalid-rows
Jun 23, 2023
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
72b3eed
Basic ArraySchema default for str series
kykyi 40f851f
Add parameterised test cases for various data types
kykyi e18aa6c
Ensure column has a default
kykyi de6e211
Add some tests asserting Column.default works as expected
kykyi a9c8a40
Add tests asserting default causes an error when there is a dtype mis…
kykyi 2e210fa
Remove inplace=True hardcoding, add default as kwarg across various c…
kykyi b626cb8
Simplify Column tests to avoid using DataFrameSchema
kykyi 212fdff
Add test to raise error if inplace is False and default is non null
kykyi 096afbb
any -> Any
kykyi 5acb3dd
clean up PR
cosmicBboy 8b709de
remove codecov
cosmicBboy e66abbd
xfail pyspark tests
cosmicBboy 91e6250
Merge branch 'unionai-oss:main' into main
kykyi c2b6e6e
Merge branch 'unionai-oss:main' into main
kykyi 5905b19
Simplify drop_invalid into a kwarg for schema.validate().
kykyi f86f279
Update docstrings
kykyi 5efc041
Add a couple more test cases
kykyi 87bce7c
Re-raise error on drop_invalid false, move some logic into a private …
kykyi fa24980
Add drop_invalid for SeriesSchema
kykyi 039fd1c
Add drop_invalid to MultiIndex
kykyi 7686b07
Small changes to fix mypy
kykyi 478fc5e
More mypy fixes
kykyi bf80ef2
Move run_checks_and_handle_errors into it's own method with core chec…
kykyi 1458f6b
Remove try/catch
kykyi b5de710
Move drop_logic into it's own method for array.py and container.py
kykyi 5935b32
drop_invalid -> drop_invalid_data
kykyi 0b2f6fb
Remove main() block from test_schemas.py
kykyi 3180d31
Fix typo
kykyi 95c4413
Add test for ColumnBackend
kykyi 2140396
Move drop_invalid from validation to schema init
kykyi 0a304e9
Stylistic changes
kykyi 39072ff
Remove incorrect rescue logic in ColumnBackend
kykyi 94394f9
Add draft docs
kykyi 1f14cca
Add functionality for drop_invalid on DataFrameModel schemas
kykyi abc0324
Standardise tests
kykyi 75b3cc7
Update docs for DataFrameModel
kykyi 9dfba4e
Add docstrings
kykyi e721458
rename of `drop_invalid_rows`, exception handling, update docs
cosmicBboy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
.. currentmodule:: pandera | ||
|
||
.. _drop_invalid_rows: | ||
|
||
Dropping Invalid Rows | ||
===================== | ||
|
||
*New in version 0.16.0* | ||
|
||
If you wish to use the validation step to remove invalid data, you can pass the | ||
``drop_invalid_rows=True`` argument to the ``schema`` object on creation. On ``schema.validate()``, | ||
if a data-level check fails, then that row which caused the failure will be removed from the dataframe | ||
when it is returned. | ||
|
||
``drop_invalid`` will prevent data-level schema errors being raised and will instead | ||
remove the rows which causes the failure. | ||
|
||
This functionality is available on ``DataFrameSchema``, ``SeriesSchema``, ``Column``, | ||
as well as ``DataFrameModel`` schemas. | ||
|
||
Dropping invalid rows with :class:`~pandera.api.pandas.container.DataFrameSchema`: | ||
|
||
.. testcode:: drop_invalid_rows_data_frame_schema | ||
|
||
import pandas as pd | ||
import pandera as pa | ||
|
||
from pandera import Check, Column, DataFrameSchema | ||
|
||
df = pd.DataFrame({"counter": ["1", "2", "3"]}) | ||
schema = DataFrameSchema( | ||
{"counter": Column(int, checks=[Check(lambda x: x >= 3)])}, | ||
drop_invalid_rows=True, | ||
) | ||
|
||
schema.validate(df, lazy=True) | ||
|
||
Dropping invalid rows with :class:`~pandera.api.pandas.array.SeriesSchema`: | ||
|
||
.. testcode:: drop_invalid_rows_series_schema | ||
|
||
import pandas as pd | ||
import pandera as pa | ||
|
||
from pandera import Check, SeriesSchema | ||
|
||
series = pd.Series(["1", "2", "3"]) | ||
schema = SeriesSchema( | ||
int, | ||
checks=[Check(lambda x: x >= 3)], | ||
drop_invalid_rows=True, | ||
) | ||
|
||
schema.validate(series, lazy=True) | ||
|
||
Dropping invalid rows with :class:`~pandera.api.pandas.components.Column`: | ||
|
||
.. testcode:: drop_invalid_rows_column | ||
|
||
import pandas as pd | ||
import pandera as pa | ||
|
||
from pandera import Check, Column | ||
|
||
df = pd.DataFrame({"counter": ["1", "2", "3"]}) | ||
schema = Column( | ||
int, | ||
name="counter", | ||
drop_invalid_rows=True, | ||
checks=[Check(lambda x: x >= 3)] | ||
) | ||
|
||
schema.validate(df, lazy=True) | ||
|
||
Dropping invalid rows with :class:`~pandera.api.pandas.model.DataFrameModel`: | ||
|
||
.. testcode:: drop_invalid_rows_data_frame_model | ||
|
||
import pandas as pd | ||
import pandera as pa | ||
|
||
from pandera import Check, DataFrameModel, Field | ||
|
||
class MySchema(DataFrameModel): | ||
counter: int = Field(in_range={"min_value": 3, "max_value": 5}) | ||
|
||
class Config: | ||
drop_invalid_rows = True | ||
|
||
|
||
MySchema.validate( | ||
pd.DataFrame({"counter": [1, 2, 3, 4, 5, 6]}), lazy=True | ||
) | ||
|
||
.. note:: | ||
In order to use ``drop_invalid_rows=True``, ``lazy=True`` must | ||
be passed to the ``schema.validate()``. :ref:`lazy_validation` enables all schema | ||
errors to be collected and raised together, meaning all invalid rows can be dropped together. | ||
This provides clear API for ensuring the validated dataframe contains only valid data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cosmicBboy this method could be moved into the parent class to remove the duplication, but I'm not sure this would be the right move. They are quite different implementations, and don't want to abstract it to the parent for some vain DRYness 😅
edit: I will move
drop_invalid_data
into the parent though