[FEAT] Support hive partitioned reads #3029

desmondcheongzx · 2024-10-11T01:40:25Z

Adds support for reading hive-style partitioned tables via a new optional hive_partitioning parameter for the read_{csv, json, parquet} functions.

This support includes:

Partitioning pruning on hive partitions.
Schema inference on hive partition values (which can overridden by user-provided schemas).
Support for interpreting __HIVE_DEFAULT_PARTITIONS__ partition values as null values (same behaviour as Hive).

codspeed-hq · 2024-10-11T01:51:32Z

CodSpeed Performance Report

Merging #3029 will not alter performance

_{Comparing desmondcheongzx:hive-partitioned-reads (2d191ef) with main (96c538b)}

Summary

✅ 17 untouched benchmarks

colin-ho

Nice work, key thing for me here would be to make sure that our hive reads are on par with other systems, easiest one to compare with would probably be pyarrow.

src/daft-micropartition/src/micropartition.rs

src/daft-plan/src/builder.rs

src/daft-scan/src/glob.rs

src/daft-hive/src/lib.rs

colin-ho · 2024-10-16T23:00:54Z

tests/io/test_hive_style_partitions.py

+    )
+
+
+def check_file(public_storage_io_config, read_fn, uri):


Couple suggestions for more robust test coverage.

Write partitioned tables to a temp dir via pyarrow write dataset api, set the partitioning to the partition columns and partitioning_flavor = 'HIVE', read_back using daft and pyarrow and assert equal.

Write partitioned tables to a temp dir via daft, read_back using daft and assert equal.

Test the most common datatypes for partitioning, strings, dates, timestamps, ints. Test for NULLs as well (nulls are HIVE_DEFAULT_PARTITION).

Try to separate out tests and reduce the number of assertions per test, especially if the assertions test different functionalities . i.e. you can have a test that reads, and another test that reads with pushdowns. This may be a personal preference, but I do believe smaller tests are easier to reason about and debug.

Thanks for the suggestions! Updated the tests accordingly.

Had to do some wonky stuff due to internal inconsistencies with handling timestamps, and null handling with CSV.

daft/io/_csv.py

src/daft-scan/src/lib.rs

tests/io/iceberg/test_iceberg_writes.py

Co-authored-by: Colin Ho <chiuhong@usc.edu>

jaychia · 2024-10-23T03:26:09Z

Any update on this crucial PR?

colin-ho

Mostly looks good to me, just some small comments.

Though I am curious, do you know why the tests are taking a little long? See: https://github.com/Eventual-Inc/Daft/actions/runs/11560540882/job/32177612980?pr=3029

src/daft-micropartition/src/micropartition.rs

src/daft-scan/src/glob.rs

src/daft-scan/src/lib.rs

tests/io/test_hive_style_partitions.py

src/daft-plan/src/builder.rs

src/daft-scan/src/lib.rs

Co-authored-by: Colin Ho <chiuhong@usc.edu>

codecov · 2024-11-04T22:33:16Z

Codecov Report

Attention: Patch coverage is 96.57321% with 11 lines in your changes missing coverage. Please review.

Project coverage is 79.09%. Comparing base (3cef614) to head (32e1dd1).

Files with missing lines	Patch %	Lines
src/daft-plan/src/builder.rs	82.75%	5 Missing ⚠️
src/daft-scan/src/lib.rs	88.88%	3 Missing ⚠️
src/daft-scan/src/glob.rs	96.61%	2 Missing ⚠️
src/daft-scan/src/hive.rs	99.47%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3029      +/-   ##
==========================================
+ Coverage   79.00%   79.09%   +0.08%     
==========================================
  Files         634      635       +1     
  Lines       76943    77167     +224     
==========================================
+ Hits        60789    61035     +246     
+ Misses      16154    16132      -22

Files with missing lines	Coverage Δ
daft/io/_csv.py	`95.65% <ø> (ø)`
daft/io/_json.py	`91.30% <ø> (ø)`
daft/io/_parquet.py	`86.20% <ø> (ø)`
daft/io/common.py	`85.00% <ø> (ø)`
src/common/error/src/error.rs	`84.84% <ø> (ø)`
src/daft-micropartition/src/micropartition.rs	`90.81% <100.00%> (ø)`
src/daft-micropartition/src/ops/cast_to_schema.rs	`100.00% <100.00%> (ø)`
src/daft-scan/src/anonymous.rs	`77.77% <100.00%> (+0.85%)`	⬆️
src/daft-scan/src/python.rs	`76.01% <100.00%> (+0.29%)`	⬆️
src/daft-scan/src/scan_task_iters.rs	`96.95% <100.00%> (ø)`
... and 5 more

... and 9 files with indirect coverage changes

desmondcheongzx · 2024-11-04T22:59:58Z

@colin-ho turns out the tests were taking awhile because we were generating many partitions which led to many scan tasks. I reduced the number of partitions so the tests should be roughly 10x less expensive now.

colin-ho

Hell yeah!

src/daft-scan/src/glob.rs

colin-ho · 2024-11-05T06:19:54Z

src/daft-scan/src/glob.rs

+            generated_fields = generated_fields
+                .non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?);


Suggested change

generated_fields = generated_fields

.non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?);

generated_fields.fields.insert(partition_field.field.name.clone(), partition_field.field.clone());

Could you just append to the existing schema fields instead of needing to make a new schema and then doing non distinct union?

Yep totally possible, shifted some stuff around to do this. Thank you again for all your attention!

Co-authored-by: Colin Ho <chiuhong@usc.edu>

github-actions bot added the enhancement New feature or request label Oct 11, 2024

desmondcheongzx force-pushed the hive-partitioned-reads branch 6 times, most recently from ae2ce64 to 90b38c3 Compare October 15, 2024 09:49

Implement hive partitioned reads

eed2461

desmondcheongzx force-pushed the hive-partitioned-reads branch from 90b38c3 to eed2461 Compare October 15, 2024 09:54

desmondcheongzx added 2 commits October 15, 2024 14:02

No partition fields no problems

fa30c24

Add tests

d61095c

desmondcheongzx requested a review from colin-ho October 15, 2024 23:04

colin-ho reviewed Oct 16, 2024

View reviewed changes

desmondcheongzx and others added 5 commits October 16, 2024 16:59

Update src/daft-scan/src/lib.rs

6ae400b

Co-authored-by: Colin Ho <chiuhong@usc.edu>

Update daft/io/_csv.py

bbc331d

Co-authored-by: Colin Ho <chiuhong@usc.edu>

Update src/daft-scan/src/glob.rs

ec3f798

Co-authored-by: Colin Ho <chiuhong@usc.edu>

Update src/daft-scan/src/glob.rs

4cefe87

Co-authored-by: Colin Ho <chiuhong@usc.edu>

Checkpoint address some review comments

549b8b8

desmondcheongzx added 9 commits October 23, 2024 17:39

Add url decoding; support __HIVE_DEFAULT_PARTITION__

5b20dc7

Keys should also be decoded

5245486

Make generated fields optional

935c35b

Fold file path column entirely into generated fields

b691ed3

Add unit tests for partitioning logic

06c9780

Finally fixed tests

9e9cdf7

Add daft roundtrip tests

8446076

Merge remote-tracking branch 'daft/main' into hive-partitioned-reads

4783d20

Disable roundtrip test for swordfish

32aed6d

desmondcheongzx requested a review from colin-ho October 29, 2024 16:02

colin-ho reviewed Nov 1, 2024

View reviewed changes

desmondcheongzx and others added 8 commits November 4, 2024 12:54

Merge remote-tracking branch 'daft/main' into hive-partitioned-reads

c88ade6

Update src/daft-micropartition/src/micropartition.rs

c428906

Co-authored-by: Colin Ho <chiuhong@usc.edu>

Remove swordfish exception

ecddf0a

Switch to schemaref

fce0d77

Reduce number of partitions used in tests

8c9d264

Elide extraneous clone

afa1f0a

Remove clone_source_field()

32e1dd1

Delay conversion of partition values into a 1D table

d4d75ee

desmondcheongzx added 2 commits November 4, 2024 14:54

Elide unnecessary table creation

658503e

Remove unneeded indexmap import

b78b9e0

desmondcheongzx requested a review from colin-ho November 4, 2024 23:00

colin-ho approved these changes Nov 5, 2024

View reviewed changes

desmondcheongzx and others added 2 commits November 5, 2024 00:06

Update src/daft-scan/src/glob.rs

18b9994

Co-authored-by: Colin Ho <chiuhong@usc.edu>

Address review comment

2d191ef

desmondcheongzx enabled auto-merge (squash) November 5, 2024 08:39

desmondcheongzx merged commit c1d82c5 into Eventual-Inc:main Nov 5, 2024
40 checks passed

desmondcheongzx deleted the hive-partitioned-reads branch November 5, 2024 09:32

colin-ho mentioned this pull request Nov 6, 2024

Implementing hive-style read #2957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Support hive partitioned reads #3029

[FEAT] Support hive partitioned reads #3029

desmondcheongzx commented Oct 11, 2024 •

edited

Loading

codspeed-hq bot commented Oct 11, 2024 •

edited

Loading

colin-ho left a comment

colin-ho Oct 16, 2024

desmondcheongzx Oct 28, 2024

jaychia commented Oct 23, 2024

colin-ho left a comment

codecov bot commented Nov 4, 2024

desmondcheongzx commented Nov 4, 2024

colin-ho left a comment

colin-ho Nov 5, 2024

desmondcheongzx Nov 5, 2024

		generated_fields = generated_fields
		.non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?);

	generated_fields = generated_fields
	.non_distinct_union(&Schema::new(vec![partition_field.field.clone()])?);
	generated_fields.fields.insert(partition_field.field.name.clone(), partition_field.field.clone());

[FEAT] Support hive partitioned reads #3029

[FEAT] Support hive partitioned reads #3029

Conversation

desmondcheongzx commented Oct 11, 2024 • edited Loading

codspeed-hq bot commented Oct 11, 2024 • edited Loading

CodSpeed Performance Report

Merging #3029 will not alter performance

Summary

colin-ho left a comment

Choose a reason for hiding this comment

colin-ho Oct 16, 2024

Choose a reason for hiding this comment

desmondcheongzx Oct 28, 2024

Choose a reason for hiding this comment

jaychia commented Oct 23, 2024

colin-ho left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 4, 2024

Codecov Report

desmondcheongzx commented Nov 4, 2024

colin-ho left a comment

Choose a reason for hiding this comment

colin-ho Nov 5, 2024

Choose a reason for hiding this comment

desmondcheongzx Nov 5, 2024

Choose a reason for hiding this comment

desmondcheongzx commented Oct 11, 2024 •

edited

Loading

codspeed-hq bot commented Oct 11, 2024 •

edited

Loading