fix: correct `UUIDType` partition representation for `BucketTransform` #2003

DinGo4DEV · 2025-05-14T08:35:44Z

Rationale for this change

Resolves #2002

Are these changes tested?

Tested Locally. Should add testcases for this later

Are there any user-facing changes?

Fokko

@DinGo4DEV Thanks for working on this!

It would be good to throw in a test as well. It can be a simple test in test_writes.py where you write to a bucket UUID table. It also looks like there linter is not happy, could you run make lint as well?

Fokko · 2025-05-16T09:14:39Z

@DinGo4DEV Again, thanks for working on this. As part of this review, I dug a bit deeper into the issues, and it looks like we're missing the Parquet LogicalTypeAnnotation (apache/arrow#46469) which causes interoperability issues with other readers.

DinGo4DEV · 2025-05-16T17:06:15Z

@Fokko Thank you for taking the time to review. I appreciate your thoughtful feedback and the effort you put into this. To fully support the UUID type, it looks like we'll need to wait for a new Arrow release (> 20.0.0). In the meantime, I’ll continue working on the test cases for my commits.

Fokko · 2025-05-16T17:45:14Z

@DinGo4DEV Yes, please do. My biggest concern is that we produce Parquet files that will not be supported by other implementations because of the missing logical annotation. Arrow releases pretty often, so it can be resolved within reasonable timespan.

Fokko · 2025-05-16T20:31:34Z

@DinGo4DEV Good news, it looks like this is fixed in the next release of Arrow: apache/arrow#45866

DinGo4DEV · 2025-05-17T11:13:42Z

@Fokko TBR, After running the test case, I found that the identity transform of uuid is not supported for writing, because the value is bytes. So I tried rewrite the Avro writer and other related components.

The uuid is still storing bytes in parquet
Changed the identity value to hex representation ec9b663b-062f-4200-a130-8de19c21b800 instead of bytes string value b'\xec\x9bf;\x06/B\x00\xa10\x8d\xe1\x9c!\xb8\x00'

data/uuid_bucket=0/uuid_identity=ec9b663b-062f-4200-a130-8de19c21b800
  |- xxxxx.parquet
data/uuid_bucket=1/uuid_identity=5f473c64-dbeb-449b-bdfa-b6b4185b1bde
  |-  xxxxx.parquet

Not sure is that correct and compatible with other integration as I haven't tried partition the uuid with identity before in other projects.
However if this PR is accepted, we still need to rewrite other test cases related to UUID.

DinGo4DEV · 2025-05-17T11:38:13Z

They also noticed that kind of problem in (apache/iceberg#13087)

DinGo4DEV · 2025-06-07T05:02:52Z

Squash commits and update testcases for uuid writer

DinGo4DEV marked this pull request as ready for review May 14, 2025 08:51

Fokko reviewed May 15, 2025

View reviewed changes

DinGo4DEV and others added 3 commits June 7, 2025 12:47

fix: correct UUIDType partition representation for BucketTransform

874b09a

rewrite the writer for support both bucket and identity transform

3b1d673

rewrite old testcases for uuid support

a83c87e

DinGo4DEV force-pushed the uuid-partition-representation branch from 253c559 to a83c87e Compare June 7, 2025 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: correct `UUIDType` partition representation for `BucketTransform` #2003

fix: correct `UUIDType` partition representation for `BucketTransform` #2003

DinGo4DEV commented May 14, 2025

Uh oh!

Fokko left a comment

Uh oh!

Fokko commented May 16, 2025

Uh oh!

DinGo4DEV commented May 16, 2025

Uh oh!

Fokko commented May 16, 2025

Uh oh!

Fokko commented May 16, 2025

Uh oh!

DinGo4DEV commented May 17, 2025

Uh oh!

DinGo4DEV commented May 17, 2025

Uh oh!

DinGo4DEV commented Jun 7, 2025

Uh oh!

Uh oh!

fix: correct UUIDType partition representation for BucketTransform #2003

Are you sure you want to change the base?

fix: correct UUIDType partition representation for BucketTransform #2003

Conversation

DinGo4DEV commented May 14, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko commented May 16, 2025

Uh oh!

DinGo4DEV commented May 16, 2025

Uh oh!

Fokko commented May 16, 2025

Uh oh!

Fokko commented May 16, 2025

Uh oh!

DinGo4DEV commented May 17, 2025

Uh oh!

DinGo4DEV commented May 17, 2025

Uh oh!

DinGo4DEV commented Jun 7, 2025

Uh oh!

Uh oh!

fix: correct `UUIDType` partition representation for `BucketTransform` #2003

fix: correct `UUIDType` partition representation for `BucketTransform` #2003