-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Support hive partitioned reads #3029
[FEAT] Support hive partitioned reads #3029
Conversation
CodSpeed Performance ReportMerging #3029 will not alter performanceComparing Summary
|
ae2ce64
to
90b38c3
Compare
90b38c3
to
eed2461
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, key thing for me here would be to make sure that our hive reads are on par with other systems, easiest one to compare with would probably be pyarrow.
) | ||
|
||
|
||
def check_file(public_storage_io_config, read_fn, uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple suggestions for more robust test coverage.
- Write partitioned tables to a temp dir via
pyarrow write dataset
api, set the partitioning to the partition columns and partitioning_flavor = 'HIVE', read_back using daft and pyarrow and assert equal. - Write partitioned tables to a temp dir via daft, read_back using daft and assert equal.
- Test the most common datatypes for partitioning, strings, dates, timestamps, ints. Test for NULLs as well (nulls are HIVE_DEFAULT_PARTITION).
- Try to separate out tests and reduce the number of assertions per test, especially if the assertions test different functionalities . i.e. you can have a test that reads, and another test that reads with pushdowns. This may be a personal preference, but I do believe smaller tests are easier to reason about and debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions! Updated the tests accordingly.
Had to do some wonky stuff due to internal inconsistencies with handling timestamps, and null handling with CSV.
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Any update on this crucial PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, just some small comments.
Though I am curious, do you know why the tests are taking a little long? See: https://github.com/Eventual-Inc/Daft/actions/runs/11560540882/job/32177612980?pr=3029
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3029 +/- ##
==========================================
+ Coverage 79.00% 79.09% +0.08%
==========================================
Files 634 635 +1
Lines 76943 77167 +224
==========================================
+ Hits 60789 61035 +246
+ Misses 16154 16132 -22
|
@colin-ho turns out the tests were taking awhile because we were generating many partitions which led to many scan tasks. I reduced the number of partitions so the tests should be roughly 10x less expensive now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hell yeah!
src/daft-scan/src/glob.rs
Outdated
generated_fields = generated_fields | ||
.non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generated_fields = generated_fields | |
.non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?); | |
generated_fields.fields.insert(partition_field.field.name.clone(), partition_field.field.clone()); |
Could you just append to the existing schema fields instead of needing to make a new schema and then doing non distinct union?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep totally possible, shifted some stuff around to do this. Thank you again for all your attention!
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Adds support for reading hive-style partitioned tables via a new optional
hive_partitioning
parameter for theread_{csv, json, parquet}
functions.This support includes:
__HIVE_DEFAULT_PARTITIONS__
partition values as null values (same behaviour as Hive).