Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support native PySpark.sql on Pandera #1213

Merged
merged 306 commits into from
Jun 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
306 commits
Select commit Hold shift + click to select a range
8a5fdcb
fixing check flow
NeerajMalhotra-QB Mar 29, 2023
e35fdd9
setting apply fn
NeerajMalhotra-QB Mar 29, 2023
74bddc5
add sub sample functionality
NeerajMalhotra-QB Mar 29, 2023
3e7d39f
adjusting test case against common attributes
NeerajMalhotra-QB Mar 29, 2023
2c04cd8
need apply for column level check
NeerajMalhotra-QB Mar 29, 2023
da613ed
adding builtin checks for pyspark
NeerajMalhotra-QB Mar 29, 2023
d14faac
adding checks for pyspark df
NeerajMalhotra-QB Mar 29, 2023
336455b
getting check registered
NeerajMalhotra-QB Mar 29, 2023
c9b3f01
fixing a bug a in error handling for schema check
NeerajMalhotra-QB Mar 30, 2023
2bced40
check_name validation fixed
NeerajMalhotra-QB Mar 30, 2023
95dfeae
implementing dtype checks for pyspark
NeerajMalhotra-QB Mar 30, 2023
2902956
updating error msg
NeerajMalhotra-QB Mar 30, 2023
63f0c3c
fixing dtype reason_code
NeerajMalhotra-QB Mar 30, 2023
bbcece6
updating builtin checks for pyspark
NeerajMalhotra-QB Mar 30, 2023
027e060
registeration
NeerajMalhotra-QB Mar 30, 2023
c5d58bb
Merge pull request #11 from NeerajMalhotra-QB/pyspark_schema
NeerajMalhotra-QB Mar 30, 2023
99446e9
Implementation of checks import and spark columns information check
jaskaransinghsidana Mar 31, 2023
337d874
Merge pull request #12 from NeerajMalhotra-QB/feature_pyspark_backend
jaskaransinghsidana Mar 31, 2023
a644b7e
enhancing __call__, checks classes and builtin_checks
NeerajMalhotra-QB Mar 31, 2023
79eeb9f
Merge pull request #13 from NeerajMalhotra-QB/pyspark_builtin_checks
NeerajMalhotra-QB Mar 31, 2023
bd86409
delete junk files
NeerajMalhotra-QB Mar 31, 2023
60c1313
Merge branch 'pyspark_builtin_checks' into develop
NeerajMalhotra-QB Mar 31, 2023
54f8b32
Changes to fix the implemtation of checks. Changed Apply function to …
jaskaransinghsidana Apr 3, 2023
f41f518
Merge pull request #14 from NeerajMalhotra-QB/feature_pyspark_backend
jaskaransinghsidana Apr 3, 2023
d5b5304
extending pyspark checks
NeerajMalhotra-QB Apr 3, 2023
288eed7
Merge pull request #15 from NeerajMalhotra-QB/ps_err
NeerajMalhotra-QB Apr 3, 2023
993c968
Fixed builtin check bug and added test for supported builtin checks f…
jaskaransinghsidana Apr 4, 2023
b17e150
Merge pull request #16 from NeerajMalhotra-QB/feature_pyspark_backend
jaskaransinghsidana Apr 4, 2023
b291aa0
add todos
NeerajMalhotra-QB Apr 5, 2023
a7c9f99
bydefault validate all checks
NeerajMalhotra-QB Apr 5, 2023
4e554db
Merge pull request #17 from NeerajMalhotra-QB/ps_lazy_checks
NeerajMalhotra-QB Apr 5, 2023
cee1ce6
fixing issue with sqlctx
NeerajMalhotra-QB Apr 5, 2023
0fd59cb
Merge pull request #18 from NeerajMalhotra-QB/ps_fix_err_func
NeerajMalhotra-QB Apr 5, 2023
811d153
add dtypes pytests
NeerajMalhotra-QB Apr 5, 2023
f9a34da
setting up schema
NeerajMalhotra-QB Apr 5, 2023
9c419e8
add negative and positive tests
NeerajMalhotra-QB Apr 5, 2023
12f9c6f
add fixtures and refactor tests
NeerajMalhotra-QB Apr 5, 2023
c806e3d
generalize spark_df func
NeerajMalhotra-QB Apr 5, 2023
149d0cc
refactor to use conftest
NeerajMalhotra-QB Apr 5, 2023
e7a99b7
use conftest
NeerajMalhotra-QB Apr 5, 2023
e06ba3c
Merge pull request #19 from NeerajMalhotra-QB/ps_dtyp_checks
NeerajMalhotra-QB Apr 5, 2023
dc7cb7a
Merge branch 'develop' into ps_check_ref
NeerajMalhotra-QB Apr 5, 2023
460e7b8
Merge pull request #20 from NeerajMalhotra-QB/ps_check_ref
NeerajMalhotra-QB Apr 5, 2023
dde0e50
add support for decimal dtype and fixing other types
NeerajMalhotra-QB Apr 5, 2023
9f9a7e1
Merge pull request #21 from NeerajMalhotra-QB/ps_dtypes
NeerajMalhotra-QB Apr 5, 2023
dded3c7
Added new Datatypes support for pyspark, test cases for dtypes pyspar…
jaskaransinghsidana Apr 7, 2023
da7c2dd
Merge pull request #22 from NeerajMalhotra-QB/feature_pyspark_backend
jaskaransinghsidana Apr 7, 2023
89b00b5
Merge branch 'develop' into ps_err_summarization
NeerajMalhotra-QB Apr 7, 2023
5027364
refactor ArraySchema
NeerajMalhotra-QB Apr 7, 2023
5223367
Merge pull request #23 from NeerajMalhotra-QB/column_cls
NeerajMalhotra-QB Apr 7, 2023
1c5802d
Merge branch 'develop' into ps_err_summarization
NeerajMalhotra-QB Apr 7, 2023
9b1f7af
rename array to column.py
NeerajMalhotra-QB Apr 7, 2023
ba42f69
Merge pull request #24 from NeerajMalhotra-QB/ps_array
NeerajMalhotra-QB Apr 7, 2023
0394f24
Merge branch 'develop' into ps_err_summarization
NeerajMalhotra-QB Apr 7, 2023
11eddc3
1) Changes in test cases to look for summarised error raise instead o…
jaskaransinghsidana Apr 10, 2023
00b92d0
Merge pull request #25 from NeerajMalhotra-QB/feature_pyspark_backend
jaskaransinghsidana Apr 10, 2023
a6a871d
add neg test
NeerajMalhotra-QB Apr 10, 2023
7fe40bf
add custom ErrorHandler
NeerajMalhotra-QB Apr 10, 2023
168468e
Added functionality to DayTimeIntervalType datatype to accept parameters
jaskaransinghsidana Apr 11, 2023
337f921
Added functionality to DayTimeIntervalType datatype to accept parameters
jaskaransinghsidana Apr 11, 2023
2e33969
return summarized error report
NeerajMalhotra-QB Apr 12, 2023
37ebdbd
replace dataframe to dict for return obj
NeerajMalhotra-QB Apr 12, 2023
1354ecf
Changed checks input datatype to custom named tuple from the existing…
jaskaransinghsidana Apr 12, 2023
e293295
Merge remote-tracking branch 'origin/feature_pyspark_backend' into fe…
jaskaransinghsidana Apr 12, 2023
7f3d21a
refactor
NeerajMalhotra-QB Apr 12, 2023
b69621f
Merge pull request #26 from NeerajMalhotra-QB/feature_pyspark_backend
jaskaransinghsidana Apr 12, 2023
3859c7c
introduce error categories
NeerajMalhotra-QB Apr 12, 2023
1d4fe55
rename error categories
NeerajMalhotra-QB Apr 12, 2023
b3890d5
fixing bug in schema.dtype.check
NeerajMalhotra-QB Apr 12, 2023
6656cad
fixing error category to by dynamic
NeerajMalhotra-QB Apr 12, 2023
5e133f4
Added checks for each datatype in test cases. Reduced the code redund…
jaskaransinghsidana Apr 13, 2023
d172f27
error_handler pass through
NeerajMalhotra-QB Apr 13, 2023
9ae55f1
Merge pull request #27 from NeerajMalhotra-QB/feature_pyspark_checks
jaskaransinghsidana Apr 13, 2023
3b7f133
add ErrorHandler to column api
NeerajMalhotra-QB Apr 13, 2023
1de9c33
removed SchemaErrors since we now aggregate in errorHandler
NeerajMalhotra-QB Apr 13, 2023
5ea21fb
fixing dict keys
NeerajMalhotra-QB Apr 13, 2023
97d5784
Added Decorator to raise TypeError in case of unexpected input type f…
jaskaransinghsidana Apr 17, 2023
3ca17e7
Merge pull request #28 from NeerajMalhotra-QB/feature_pyspark_checks
jaskaransinghsidana Apr 17, 2023
504b531
Merge branch 'develop' into ps_err_dict
NeerajMalhotra-QB Apr 17, 2023
0e30211
Merge pull request #29 from NeerajMalhotra-QB/ps_err_dict
NeerajMalhotra-QB Apr 17, 2023
d69e68d
replace validator with report_errors
NeerajMalhotra-QB Apr 18, 2023
839e8f8
Merge pull request #30 from NeerajMalhotra-QB/new_validator
NeerajMalhotra-QB Apr 18, 2023
6d7d431
cleaning debugs
NeerajMalhotra-QB Apr 18, 2023
48847ad
Merge pull request #31 from NeerajMalhotra-QB/remove_breaks
NeerajMalhotra-QB Apr 18, 2023
7110b85
Support DataModels and Field
NeerajMalhotra-QB Apr 24, 2023
4109321
Added Decorator to raise TypeError in case of unexpected input type f…
jaskaransinghsidana Apr 24, 2023
9470d5c
Fix to run using the class schema type
jaskaransinghsidana Apr 24, 2023
b11a8c5
use alias types
NeerajMalhotra-QB Apr 24, 2023
06e4e6e
clean up
NeerajMalhotra-QB Apr 24, 2023
de034f8
Merge pull request #32 from NeerajMalhotra-QB/fields
NeerajMalhotra-QB Apr 24, 2023
ba38028
add new typing for pyspark.sql
NeerajMalhotra-QB Apr 24, 2023
59cbd62
Merge pull request #33 from NeerajMalhotra-QB/pyspark-sql-typing
NeerajMalhotra-QB Apr 24, 2023
35e5dba
Added Decorator to raise TypeError in case of unexpected input type f…
jaskaransinghsidana Apr 24, 2023
09f8133
Added changes to support raising error for use of datatype not suppor…
jaskaransinghsidana Apr 26, 2023
7c872b9
Merge remote-tracking branch 'origin/feature_pyspark_checks' into fea…
jaskaransinghsidana Apr 26, 2023
25aa055
Merge pull request #34 from NeerajMalhotra-QB/feature_pyspark_checks
jaskaransinghsidana Apr 26, 2023
bb692de
support bare dtypes for DataFrameModel
NeerajMalhotra-QB Apr 26, 2023
7e7d2b9
Merge branch 'develop' into bare_dtypes
NeerajMalhotra-QB Apr 26, 2023
3fcbfc8
Merge pull request #35 from NeerajMalhotra-QB/bare_dtypes
NeerajMalhotra-QB Apr 26, 2023
726cf06
remove resolved TODOs and breakpoints
NeerajMalhotra-QB Apr 26, 2023
ecc14c4
Merge pull request #36 from NeerajMalhotra-QB/cleanup
NeerajMalhotra-QB Apr 26, 2023
1ff8881
change to bare types
NeerajMalhotra-QB Apr 26, 2023
19c60b1
use spark types instead of bare types
NeerajMalhotra-QB Apr 27, 2023
c2d6732
using SchemaErrorReason instead of hardcode in container
NeerajMalhotra-QB Apr 27, 2023
a343e50
fixing an issue with error reason codes
NeerajMalhotra-QB Apr 27, 2023
a24d85e
Merge pull request #37 from NeerajMalhotra-QB/error_reason
NeerajMalhotra-QB Apr 27, 2023
ab82752
Merge branch 'develop' into field_tests
NeerajMalhotra-QB Apr 27, 2023
b45ae3d
minor fix
NeerajMalhotra-QB Apr 27, 2023
c67131e
Merge pull request #38 from NeerajMalhotra-QB/field_tests
NeerajMalhotra-QB Apr 27, 2023
0995c07
fixing checks and errors in pyspark
NeerajMalhotra-QB Apr 27, 2023
438d650
Merge pull request #39 from NeerajMalhotra-QB/spark_checks
NeerajMalhotra-QB Apr 27, 2023
9497f3e
Changes include the following:
jaskaransinghsidana Apr 28, 2023
a677bd1
Merge pull request #40 from NeerajMalhotra-QB/feature_pyspark_checks
jaskaransinghsidana Apr 28, 2023
4fb35c4
enhancing dataframeschema and model classes
NeerajMalhotra-QB Apr 28, 2023
d07e046
Merge branch 'develop' into update_columns
NeerajMalhotra-QB Apr 28, 2023
59e6e0a
Merge pull request #41 from NeerajMalhotra-QB/update_columns
NeerajMalhotra-QB Apr 28, 2023
c9fb451
Changes to remove the pandas dependency
jaskaransinghsidana May 2, 2023
90c5635
Refactoring of the checks test functions
jaskaransinghsidana May 2, 2023
6baefc8
Fixing the test case breaking
jaskaransinghsidana May 2, 2023
370a480
Merge pull request #42 from NeerajMalhotra-QB/fix_develop_test_failure
jaskaransinghsidana May 2, 2023
e4e9ec9
Isort and Black formatting
jaskaransinghsidana May 2, 2023
958c23b
Container Test function failure
jaskaransinghsidana May 2, 2023
4af744a
Merge pull request #44 from NeerajMalhotra-QB/fix_develop_test_failure
jaskaransinghsidana May 2, 2023
b047f49
Isort and black linting
jaskaransinghsidana May 2, 2023
020b0f0
Changes to remove the pandas dependency
jaskaransinghsidana May 2, 2023
344d37a
Refactoring of the checks test functions
jaskaransinghsidana May 2, 2023
5288988
Isort and black linting
jaskaransinghsidana May 2, 2023
f530045
Added Changes to refactor the checks class. Fixes to some test cases …
jaskaransinghsidana May 2, 2023
729f395
Merge remote-tracking branch 'origin/feature_pyspark_checks' into fea…
jaskaransinghsidana May 2, 2023
a342072
Merge pull request #46 from NeerajMalhotra-QB/feature_pyspark_checks
jaskaransinghsidana May 2, 2023
4748192
Removing breakpoint
jaskaransinghsidana May 2, 2023
2a7cf9b
Merge pull request #47 from NeerajMalhotra-QB/feature_pyspark_checks
jaskaransinghsidana May 2, 2023
2443915
fixing raise error
NeerajMalhotra-QB May 2, 2023
279d586
adding metadata dict
NeerajMalhotra-QB May 2, 2023
6fcfadd
Removing the reference of pandas from docstrings
jaskaransinghsidana May 3, 2023
7bc454e
Merge pull request #48 from NeerajMalhotra-QB/fix_docstring_pyspark
jaskaransinghsidana May 3, 2023
fe1b5c3
Removing redundant code block in utils
jaskaransinghsidana May 3, 2023
25de40d
Merge pull request #49 from NeerajMalhotra-QB/fix_docstring_pyspark
jaskaransinghsidana May 3, 2023
b968930
Changes to return dataframe with errors property
jaskaransinghsidana May 3, 2023
0546663
add accessor for errorHandler
NeerajMalhotra-QB May 3, 2023
c09e633
support errors access on pyspark.sql
NeerajMalhotra-QB May 3, 2023
26f0635
Merge pull request #50 from NeerajMalhotra-QB/err_accessor
NeerajMalhotra-QB May 3, 2023
904782d
updating pyspark error tcs
NeerajMalhotra-QB May 3, 2023
ce39a08
fixing model test cases
NeerajMalhotra-QB May 3, 2023
ab01cc4
adjusting errors to use pandera.errors
NeerajMalhotra-QB May 3, 2023
439c94f
use accessor instead of dict
NeerajMalhotra-QB May 3, 2023
f550125
Merge pull request #51 from NeerajMalhotra-QB/align_tcs
NeerajMalhotra-QB May 3, 2023
011f048
adding accessor
NeerajMalhotra-QB May 3, 2023
7d4ebeb
revert to develop
NeerajMalhotra-QB May 3, 2023
5ebd971
Merge pull request #52 from NeerajMalhotra-QB/fix_docstring_pyspark
NeerajMalhotra-QB May 3, 2023
178d8d0
Removal of imports which are not needed and improved test case.
jaskaransinghsidana May 4, 2023
945353f
Merge pull request #53 from NeerajMalhotra-QB/fix_docstring_pyspark
jaskaransinghsidana May 4, 2023
39668b4
setting independent pyspark import
NeerajMalhotra-QB May 4, 2023
eebdb36
pyspark imports
NeerajMalhotra-QB May 4, 2023
dfcd15d
Merge pull request #54 from NeerajMalhotra-QB/pyspark_imports
NeerajMalhotra-QB May 4, 2023
29fc93e
revert comments
NeerajMalhotra-QB May 4, 2023
47a4d04
Merge branch 'develop' into meta_tags
NeerajMalhotra-QB May 4, 2023
3015ca8
store and retrieve metadata at schema levels
NeerajMalhotra-QB May 4, 2023
0935605
adding metadata support
NeerajMalhotra-QB May 8, 2023
d11037d
Merge pull request #55 from NeerajMalhotra-QB/meta_tags
NeerajMalhotra-QB May 8, 2023
7aebaf2
Added changes to support parameter based run.
jaskaransinghsidana May 9, 2023
f6f2446
Changing the default value in config
jaskaransinghsidana May 9, 2023
6cde57c
Merge pull request #56 from NeerajMalhotra-QB/feature_kill_switch
NeerajMalhotra-QB May 9, 2023
5b01e87
change to consistent interface
NeerajMalhotra-QB May 9, 2023
1269eab
Merge pull request #57 from NeerajMalhotra-QB/validate
NeerajMalhotra-QB May 9, 2023
b01075c
Changes to remove config yaml and introduce environment variables for…
jaskaransinghsidana May 10, 2023
ae86903
cleaning api/pyspark
NeerajMalhotra-QB May 10, 2023
6e7cb8a
backend and tests
NeerajMalhotra-QB May 10, 2023
2d706a0
Merge pull request #59 from NeerajMalhotra-QB/refactor
NeerajMalhotra-QB May 10, 2023
36d648b
adding setter on errors accessors for pyspark
NeerajMalhotra-QB May 10, 2023
9d7fc7d
Merge pull request #60 from NeerajMalhotra-QB/getter_for_error
NeerajMalhotra-QB May 10, 2023
dd98d4c
reformatting error dict
NeerajMalhotra-QB May 11, 2023
97ac681
Merge pull request #61 from NeerajMalhotra-QB/pretty_errs
NeerajMalhotra-QB May 11, 2023
2254c79
Changes to remove config yaml and introduce environment variables for…
jaskaransinghsidana May 10, 2023
e330ddb
Changes to rename the config object and call only in utils.py
jaskaransinghsidana May 11, 2023
e708989
Merge remote-tracking branch 'origin/feature_kill_switch' into featur…
jaskaransinghsidana May 11, 2023
211a48e
Fixing merge conflict issue
jaskaransinghsidana May 11, 2023
58db213
Updating the test cases to support new checks types
jaskaransinghsidana May 11, 2023
a595ed3
Added individualized test for each configuration type.
jaskaransinghsidana May 12, 2023
4aea5b8
Removing unnecessary prints
jaskaransinghsidana May 12, 2023
0cb3144
The changes include the following:
jaskaransinghsidana May 15, 2023
c6e1992
Fix reference to with wrong key in test_pyspark_schema_data_checks
jaskaransinghsidana May 15, 2023
3892316
minor change
NeerajMalhotra-QB May 15, 2023
97697fb
Merge pull request #58 from NeerajMalhotra-QB/feature_kill_switch
NeerajMalhotra-QB May 15, 2023
ccae3ed
Added Support for docstring substitution method.
jaskaransinghsidana May 16, 2023
922a5f3
Removing an extra indent
jaskaransinghsidana May 16, 2023
1d04b1b
Removing commented docstring substitution from __new__ method
jaskaransinghsidana May 16, 2023
f3f435e
remove union
NeerajMalhotra-QB May 16, 2023
1945ac9
cleaning
NeerajMalhotra-QB May 16, 2023
aa4f498
Merge pull request #64 from NeerajMalhotra-QB/ps_topics
NeerajMalhotra-QB May 16, 2023
3339ee2
Feature to add metadata dictionary for pandas schema
jaskaransinghsidana May 17, 2023
5a0d39c
Added test to check the docstring substitution decorator
jaskaransinghsidana May 17, 2023
8189e4f
Added test to check the docstring substitution decorator
jaskaransinghsidana May 17, 2023
067a64d
Merge pull request #63 from NeerajMalhotra-QB/feature_doc_string
jaskaransinghsidana May 17, 2023
1c2e521
Feature to add metadata dictionary for pandas schema
jaskaransinghsidana May 17, 2023
32d0656
Merge remote-tracking branch 'origin/feature_pandas_metadata' into fe…
jaskaransinghsidana May 18, 2023
736c381
Changes to ensure only pandas run does not import pyspark dependencies
jaskaransinghsidana May 18, 2023
c24bf4d
Fix of imports for pandas and pyspark for separation
jaskaransinghsidana May 18, 2023
6b6b20e
Rename the function from pyspark to pandas
jaskaransinghsidana May 18, 2023
bea1957
black lint and isort
jaskaransinghsidana May 18, 2023
d73c48d
black lint and isort
jaskaransinghsidana May 18, 2023
9420195
Merge remote-tracking branch 'origin/feature_doc_string' into feature…
jaskaransinghsidana May 19, 2023
070e495
Fixes of pyliny issue and suppression wherever necessary
jaskaransinghsidana May 19, 2023
ba8d659
Merge pull request #65 from NeerajMalhotra-QB/feature_pandas_metadata
NeerajMalhotra-QB May 19, 2023
94f0e9a
Merge pull request #66 from NeerajMalhotra-QB/fix_linting_issue
NeerajMalhotra-QB May 19, 2023
91cd3ea
Fixes of mypy failures and redone black linting post changes.
jaskaransinghsidana May 22, 2023
97fc7bf
Merge pull request #67 from NeerajMalhotra-QB/fix_linting_issue
NeerajMalhotra-QB May 22, 2023
06583f5
Added new test cases, removed redundant codes and black lint.
jaskaransinghsidana May 24, 2023
125a574
Fixed the doc strings, added functionality and test for custom checks
jaskaransinghsidana May 26, 2023
a5564c6
add rst for pyspark.sql
NeerajMalhotra-QB May 26, 2023
8a4fd57
removing rst
NeerajMalhotra-QB May 26, 2023
6e1439f
Renamed check name and Fixed pylint and mypy issues
jaskaransinghsidana May 29, 2023
51feffc
Merge pull request #68 from NeerajMalhotra-QB/fix_linting_issue
jaskaransinghsidana May 29, 2023
15c502a
add rst for pyspark.sql
NeerajMalhotra-QB May 26, 2023
32bb483
Merge remote-tracking branch 'origin/readme' into readme
jaskaransinghsidana May 29, 2023
2b3189c
Fixed the doc strings, added functionality and test for custom checks
jaskaransinghsidana May 26, 2023
f81c31f
removing rst
NeerajMalhotra-QB May 26, 2023
8ba4c2c
Renamed check name and Fixed pylint and mypy issues
jaskaransinghsidana May 29, 2023
6d438cc
Merge remote-tracking branch 'origin/feature_doc_string' into feature…
jaskaransinghsidana May 29, 2023
7b857db
add rst for pyspark.sql
NeerajMalhotra-QB May 26, 2023
12f1272
Merge remote-tracking branch 'origin/readme' into readme
jaskaransinghsidana May 30, 2023
fb9e724
Rename for environment variable key name
jaskaransinghsidana May 30, 2023
2e4774b
removing rst
jaskaransinghsidana May 30, 2023
1fd48ea
Black lint
jaskaransinghsidana May 30, 2023
ebe3b48
Merge pull request #69 from NeerajMalhotra-QB/feature_doc_string
NeerajMalhotra-QB May 30, 2023
1c6df03
Removed daytime interval type
jaskaransinghsidana May 31, 2023
1cb4eda
Merge pull request #70 from NeerajMalhotra-QB/removed_daytime_interval
jaskaransinghsidana Jun 2, 2023
dafe4f3
Merge remote-tracking branch 'upstream/dev' into NeerajMalhotra-QB:un…
NeerajMalhotra-QB Jun 2, 2023
165c77c
refactor
NeerajMalhotra-QB Jun 2, 2023
a095bb7
override pyspark patching of __class_getitem__
cosmicBboy Jun 2, 2023
ac59da2
fixiing mypy error
NeerajMalhotra-QB Jun 5, 2023
9264dfe
lint fixes
NeerajMalhotra-QB Jun 5, 2023
10593a0
lint fixes
NeerajMalhotra-QB Jun 5, 2023
23067aa
fixing more lint and type issues
NeerajMalhotra-QB Jun 5, 2023
974b3c9
fixing mypy issues
NeerajMalhotra-QB Jun 5, 2023
e5ca49c
fixing doctest
NeerajMalhotra-QB Jun 5, 2023
c8a1bee
doctest
NeerajMalhotra-QB Jun 5, 2023
06462d4
fixing doctest
NeerajMalhotra-QB Jun 5, 2023
30acb4a
adding doctest:metadata for pandas container classes
NeerajMalhotra-QB Jun 5, 2023
ea4537c
doctest
NeerajMalhotra-QB Jun 5, 2023
8312c9a
fixing doctest
NeerajMalhotra-QB Jun 6, 2023
12de1b9
fixing rst
NeerajMalhotra-QB Jun 6, 2023
70e7c62
black formatting
NeerajMalhotra-QB Jun 6, 2023
5b215b9
fixing str repr for DataFrameSchema across rst
NeerajMalhotra-QB Jun 6, 2023
8e7d24f
add ps.DataFrame
NeerajMalhotra-QB Jun 6, 2023
be88a04
fixing tests
NeerajMalhotra-QB Jun 6, 2023
85a52b2
fix lint
cosmicBboy Jun 8, 2023
6cd7aa4
use full class name in pandas accessor
cosmicBboy Jun 8, 2023
23d43e5
use os.environ instead of parameters.yaml
NeerajMalhotra-QB Jun 8, 2023
30bad1b
simplify config
cosmicBboy Jun 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 46 additions & 30 deletions asv_bench/benchmarks/dataframe_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,20 @@
import pandas as pd

from pandera import (
Column, DataFrameSchema, Bool, Category, Check,
DateTime, Float, Int, Object, String, Timedelta,
check_input, check_output)
Column,
DataFrameSchema,
Bool,
Category,
Check,
DateTime,
Float,
Int,
Object,
String,
Timedelta,
check_input,
check_output,
)


class Validate:
Expand All @@ -14,41 +25,46 @@ class Validate:

def setup(self):
self.schema = DataFrameSchema(
{
"a": Column(Int),
"b": Column(Float),
"c": Column(String),
"d": Column(Bool),
"e": Column(Category),
"f": Column(Object),
"g": Column(DateTime),
"i": Column(Timedelta),
},
)
{
"a": Column(Int),
"b": Column(Float),
"c": Column(String),
"d": Column(Bool),
"e": Column(Category),
"f": Column(Object),
"g": Column(DateTime),
"i": Column(Timedelta),
},
)
self.df = pd.DataFrame(
{
"a": [1, 2, 3],
"b": [1.1, 2.5, 9.9],
"c": ["z", "y", "x"],
"d": [True, True, False],
"e": pd.Series(["c2", "c1", "c3"], dtype="category"),
"f": [(3,), (2,), (1,)],
"g": [pd.Timestamp("2015-02-01"),
pd.Timestamp("2015-02-02"),
pd.Timestamp("2015-02-03")],
"i": [pd.Timedelta(1, unit="D"),
pd.Timedelta(5, unit="D"),
pd.Timedelta(9, unit="D")]
})
{
"a": [1, 2, 3],
"b": [1.1, 2.5, 9.9],
"c": ["z", "y", "x"],
"d": [True, True, False],
"e": pd.Series(["c2", "c1", "c3"], dtype="category"),
"f": [(3,), (2,), (1,)],
"g": [
pd.Timestamp("2015-02-01"),
pd.Timestamp("2015-02-02"),
pd.Timestamp("2015-02-03"),
],
"i": [
pd.Timedelta(1, unit="D"),
pd.Timedelta(5, unit="D"),
pd.Timedelta(9, unit="D"),
],
}
)

def time_df_schema(self):
self.schema.validate(self.df)

def mem_df_schema(self):
self.schema.validate(self.df)
self.schema.validate(self.df)

def peakmem_df_schema(self):
self.schema.validate(self.df)
self.schema.validate(self.df)


class Decorators:
Expand Down
42 changes: 27 additions & 15 deletions asv_bench/benchmarks/series_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,20 @@
import pandas as pd

from pandera import (
Column, DataFrameSchema, SeriesSchema, Bool, Category, Check,
DateTime, Float, Int, Object, String, Timedelta, String)
Column,
DataFrameSchema,
SeriesSchema,
Bool,
Category,
Check,
DateTime,
Float,
Int,
Object,
String,
Timedelta,
String,
)


class Validate:
Expand All @@ -13,23 +25,23 @@ class Validate:

def setup(self):
self.schema = SeriesSchema(
String,
checks=[
Check(lambda s: s.str.startswith("foo")),
Check(lambda s: s.str.endswith("bar")),
Check(lambda x: len(x) > 3, element_wise=True)
],
nullable=False,
unique=False,
name="my_series")
self.series = pd.Series(["foobar", "foobar", "foobar"],
name="my_series")
String,
checks=[
Check(lambda s: s.str.startswith("foo")),
Check(lambda s: s.str.endswith("bar")),
Check(lambda x: len(x) > 3, element_wise=True),
],
nullable=False,
unique=False,
name="my_series",
)
self.series = pd.Series(["foobar", "foobar", "foobar"], name="my_series")

def time_series_schema(self):
self.schema.validate(self.series)

def mem_series_schema(self):
self.schema.validate(self.series)
self.schema.validate(self.series)

def peakmem_series_schema(self):
self.schema.validate(self.series)
self.schema.validate(self.series)
3 changes: 0 additions & 3 deletions conf/pyspark/parameters.yaml

This file was deleted.

2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@
)
copybutton_prompt_is_regexp = True


# this is a workaround to filter out forward reference issue in
# sphinx_autodoc_typehints
class FilterPandasTypeAnnotationWarning(pylogging.Filter):
Expand Down Expand Up @@ -215,6 +216,7 @@ def filter(self, record: pylogging.LogRecord) -> bool:
FilterPandasTypeAnnotationWarning()
)


# based on pandas/doc/source/conf.py
def linkcode_resolve(domain, info):
"""Determine the URL corresponding to Python object."""
Expand Down
5 changes: 3 additions & 2 deletions docs/source/dataframe_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -205,10 +205,11 @@ You can easily convert a :class:`~pandera.api.pandas.model.DataFrameModel` class
coerce=False,
dtype=None,
index=None,
strict=False
strict=False,
name=InputSchema,
ordered=False,
unique_column_names=False
unique_column_names=False,
metadata=None,
)>

You can also use the :meth:`~pandera.api.pandas.model.DataFrameModel.validate` method to
Expand Down
10 changes: 6 additions & 4 deletions docs/source/dataframe_schemas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -810,10 +810,11 @@ data pipeline:
coerce=False,
dtype=None,
index=None,
strict=True
strict=True,
name=None,
ordered=False,
unique_column_names=False
unique_column_names=False,
metadata=None,
)>

If during the course of a data pipeline one of your columns is moved into the
Expand Down Expand Up @@ -858,10 +859,11 @@ the pipeline output.
name=None,
ordered=True
)>,
strict=True
strict=True,
name=None,
ordered=False,
unique_column_names=False
unique_column_names=False,
metadata=None,
)>


Expand Down
5 changes: 3 additions & 2 deletions docs/source/schema_inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,11 @@ is a simple example:
coerce=True,
dtype=None,
index=<Schema Index(name=None, type=DataType(int64))>,
strict=False
strict=False,
name=None,
ordered=False,
unique_column_names=False
unique_column_names=False,
metadata=None,
)>


Expand Down
2 changes: 1 addition & 1 deletion pandera/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from pandera.api.pandas.components import Column, Index, MultiIndex
from pandera.api.pandas.model import DataFrameModel, SchemaModel
from pandera.api.pandas.model_components import Field, check, dataframe_check
from pandera.decorators import check_input, check_io, check_output, check_types
from pandera.dtypes import (
Bool,
Category,
Expand Down Expand Up @@ -62,7 +63,6 @@
import pandera.backends.pandas

from pandera.schema_inference.pandas import infer_schema
from pandera.decorators import check_input, check_io, check_output, check_types
from pandera.version import __version__


Expand Down
4 changes: 2 additions & 2 deletions pandera/accessors/pandas_accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class PanderaDataFrameAccessor(PanderaAccessor):
def check_schema_type(schema):
if not isinstance(schema, DataFrameSchema):
raise TypeError(
f"schema arg must be a DataFrameSchema, found {type(schema)}"
f"schema arg must be a {DataFrameSchema}, found {type(schema)}"
)


Expand All @@ -55,5 +55,5 @@ class PanderaSeriesAccessor(PanderaAccessor):
def check_schema_type(schema):
if not isinstance(schema, SeriesSchema):
raise TypeError(
f"schema arg must be a SeriesSchema, found {type(schema)}"
f"schema arg must be a {SeriesSchema}, found {type(schema)}"
)
27 changes: 10 additions & 17 deletions pandera/accessors/pyspark_sql_accessor.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,16 @@
"""Custom accessor functionality for PySpark.Sql.
"""Custom accessor functionality for PySpark.Sql. Register pyspark accessor for pandera schema metadata.
"""

import warnings
from functools import wraps
from typing import Optional, Union

from typing import Optional

from pandera.api.pyspark.container import DataFrameSchema
from pandera.api.pyspark.error_handler import ErrorHandler

"""Register pyspark accessor for pandera schema metadata."""


Schemas = Union[DataFrameSchema]
Errors = Union[ErrorHandler]
Schemas = DataFrameSchema # type: ignore
Errors = ErrorHandler # type: ignore


# Todo Refactor to create a seperate module for panderaAccessor
class PanderaAccessor:
"""Pandera accessor for pyspark object."""

Expand All @@ -27,7 +21,7 @@ def __init__(self, pyspark_obj):
self._errors: Optional[Errors] = None

@staticmethod
def check_schema_type(schema: Schemas):
def check_schema_type(schema: Schemas): # type: ignore
"""Abstract method for checking the schema type."""
raise NotImplementedError

Expand All @@ -38,18 +32,18 @@ def add_schema(self, schema):
return self._pyspark_obj

@property
def schema(self) -> Optional[Schemas]:
def schema(self) -> Optional[Schemas]: # type: ignore
"""Access schema metadata."""
return self._schema

@property
def errors(self) -> Optional[Errors]:
"""Access errors data."""
def errors(self) -> Optional[Errors]: # type: ignore
"""Access errors details."""
return self._errors

@errors.setter
def errors(self, value: dict):
"""Set errors data."""
def errors(self, value: Optional[Errors]): # type: ignore
"""Set errors details."""
self._errors = value


Expand Down Expand Up @@ -133,4 +127,3 @@ def check_schema_type(schema):


register_dataframe_accessor("pandera")(PanderaDataFrameAccessor)
# register_series_accessor("pandera")(PanderaSeriesAccessor)
Loading