Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support hive partitioned reads #3029

Merged

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Oct 11, 2024

Adds support for reading hive-style partitioned tables via a new optional hive_partitioning parameter for the read_{csv, json, parquet} functions.

This support includes:

  1. Partitioning pruning on hive partitions.
  2. Schema inference on hive partition values (which can overridden by user-provided schemas).
  3. Support for interpreting __HIVE_DEFAULT_PARTITIONS__ partition values as null values (same behaviour as Hive).

@github-actions github-actions bot added the enhancement New feature or request label Oct 11, 2024
Copy link

codspeed-hq bot commented Oct 11, 2024

CodSpeed Performance Report

Merging #3029 will not alter performance

Comparing desmondcheongzx:hive-partitioned-reads (2d191ef) with main (96c538b)

Summary

✅ 17 untouched benchmarks

@desmondcheongzx desmondcheongzx force-pushed the hive-partitioned-reads branch 6 times, most recently from ae2ce64 to 90b38c3 Compare October 15, 2024 09:49
Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, key thing for me here would be to make sure that our hive reads are on par with other systems, easiest one to compare with would probably be pyarrow.

src/daft-micropartition/src/micropartition.rs Outdated Show resolved Hide resolved
src/daft-plan/src/builder.rs Outdated Show resolved Hide resolved
src/daft-scan/src/glob.rs Outdated Show resolved Hide resolved
src/daft-scan/src/glob.rs Outdated Show resolved Hide resolved
src/daft-hive/src/lib.rs Outdated Show resolved Hide resolved
src/daft-hive/src/lib.rs Outdated Show resolved Hide resolved
)


def check_file(public_storage_io_config, read_fn, uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple suggestions for more robust test coverage.

  • Write partitioned tables to a temp dir via pyarrow write dataset api, set the partitioning to the partition columns and partitioning_flavor = 'HIVE', read_back using daft and pyarrow and assert equal.
  • Write partitioned tables to a temp dir via daft, read_back using daft and assert equal.
  • Test the most common datatypes for partitioning, strings, dates, timestamps, ints. Test for NULLs as well (nulls are HIVE_DEFAULT_PARTITION).
  • Try to separate out tests and reduce the number of assertions per test, especially if the assertions test different functionalities . i.e. you can have a test that reads, and another test that reads with pushdowns. This may be a personal preference, but I do believe smaller tests are easier to reason about and debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! Updated the tests accordingly.

Had to do some wonky stuff due to internal inconsistencies with handling timestamps, and null handling with CSV.

daft/io/_csv.py Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
tests/io/iceberg/test_iceberg_writes.py Outdated Show resolved Hide resolved
desmondcheongzx and others added 5 commits October 16, 2024 16:59
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Co-authored-by: Colin Ho <chiuhong@usc.edu>
Co-authored-by: Colin Ho <chiuhong@usc.edu>
@jaychia
Copy link
Contributor

jaychia commented Oct 23, 2024

Any update on this crucial PR?

Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, just some small comments.

Though I am curious, do you know why the tests are taking a little long? See: https://github.com/Eventual-Inc/Daft/actions/runs/11560540882/job/32177612980?pr=3029

Screenshot 2024-10-31 at 10 40 07 PM

src/daft-micropartition/src/micropartition.rs Outdated Show resolved Hide resolved
src/daft-scan/src/glob.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
tests/io/test_hive_style_partitions.py Outdated Show resolved Hide resolved
src/daft-plan/src/builder.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 4, 2024

Codecov Report

Attention: Patch coverage is 96.57321% with 11 lines in your changes missing coverage. Please review.

Project coverage is 79.09%. Comparing base (3cef614) to head (32e1dd1).

Files with missing lines Patch % Lines
src/daft-plan/src/builder.rs 82.75% 5 Missing ⚠️
src/daft-scan/src/lib.rs 88.88% 3 Missing ⚠️
src/daft-scan/src/glob.rs 96.61% 2 Missing ⚠️
src/daft-scan/src/hive.rs 99.47% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3029      +/-   ##
==========================================
+ Coverage   79.00%   79.09%   +0.08%     
==========================================
  Files         634      635       +1     
  Lines       76943    77167     +224     
==========================================
+ Hits        60789    61035     +246     
+ Misses      16154    16132      -22     
Files with missing lines Coverage Δ
daft/io/_csv.py 95.65% <ø> (ø)
daft/io/_json.py 91.30% <ø> (ø)
daft/io/_parquet.py 86.20% <ø> (ø)
daft/io/common.py 85.00% <ø> (ø)
src/common/error/src/error.rs 84.84% <ø> (ø)
src/daft-micropartition/src/micropartition.rs 90.81% <100.00%> (ø)
src/daft-micropartition/src/ops/cast_to_schema.rs 100.00% <100.00%> (ø)
src/daft-scan/src/anonymous.rs 77.77% <100.00%> (+0.85%) ⬆️
src/daft-scan/src/python.rs 76.01% <100.00%> (+0.29%) ⬆️
src/daft-scan/src/scan_task_iters.rs 96.95% <100.00%> (ø)
... and 5 more

... and 9 files with indirect coverage changes

@desmondcheongzx
Copy link
Contributor Author

@colin-ho turns out the tests were taking awhile because we were generating many partitions which led to many scan tasks. I reduced the number of partitions so the tests should be roughly 10x less expensive now.

Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hell yeah!

src/daft-scan/src/glob.rs Outdated Show resolved Hide resolved
Comment on lines 199 to 200
generated_fields = generated_fields
.non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
generated_fields = generated_fields
.non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?);
generated_fields.fields.insert(partition_field.field.name.clone(), partition_field.field.clone());

Could you just append to the existing schema fields instead of needing to make a new schema and then doing non distinct union?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep totally possible, shifted some stuff around to do this. Thank you again for all your attention!

desmondcheongzx and others added 2 commits November 5, 2024 00:06
Co-authored-by: Colin Ho <chiuhong@usc.edu>
@desmondcheongzx desmondcheongzx enabled auto-merge (squash) November 5, 2024 08:39
@desmondcheongzx desmondcheongzx merged commit c1d82c5 into Eventual-Inc:main Nov 5, 2024
40 checks passed
@desmondcheongzx desmondcheongzx deleted the hive-partitioned-reads branch November 5, 2024 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants