Skip to content

fix: correct UUIDType partition representation for BucketTransform #2003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DinGo4DEV
Copy link

Rationale for this change

Resolves #2002

Are these changes tested?

Tested Locally. Should add testcases for this later

Are there any user-facing changes?

@DinGo4DEV DinGo4DEV marked this pull request as ready for review May 14, 2025 08:51
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DinGo4DEV Thanks for working on this!

It would be good to throw in a test as well. It can be a simple test in test_writes.py where you write to a bucket UUID table. It also looks like there linter is not happy, could you run make lint as well?

@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

@DinGo4DEV Again, thanks for working on this. As part of this review, I dug a bit deeper into the issues, and it looks like we're missing the Parquet LogicalTypeAnnotation (apache/arrow#46469) which causes interoperability issues with other readers.

@DinGo4DEV
Copy link
Author

@Fokko Thank you for taking the time to review. I appreciate your thoughtful feedback and the effort you put into this. To fully support the UUID type, it looks like we'll need to wait for a new Arrow release (> 20.0.0). In the meantime, I’ll continue working on the test cases for my commits.

@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

@DinGo4DEV Yes, please do. My biggest concern is that we produce Parquet files that will not be supported by other implementations because of the missing logical annotation. Arrow releases pretty often, so it can be resolved within reasonable timespan.

@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

@DinGo4DEV Good news, it looks like this is fixed in the next release of Arrow: apache/arrow#45866

@DinGo4DEV
Copy link
Author

@Fokko TBR, After running the test case, I found that the identity transform of uuid is not supported for writing, because the value is bytes. So I tried rewrite the Avro writer and other related components.

  • The uuid is still storing bytes in parquet
  • Changed the identity value to hex representation ec9b663b-062f-4200-a130-8de19c21b800 instead of bytes string value b'\xec\x9bf;\x06/B\x00\xa10\x8d\xe1\x9c!\xb8\x00'
data/uuid_bucket=0/uuid_identity=ec9b663b-062f-4200-a130-8de19c21b800
  |- xxxxx.parquet
data/uuid_bucket=1/uuid_identity=5f473c64-dbeb-449b-bdfa-b6b4185b1bde
  |-  xxxxx.parquet

Not sure is that correct and compatible with other integration as I haven't tried partition the uuid with identity before in other projects.
However if this PR is accepted, we still need to rewrite other test cases related to UUID.

@DinGo4DEV
Copy link
Author

They also noticed that kind of problem in (apache/iceberg#13087)

@DinGo4DEV DinGo4DEV force-pushed the uuid-partition-representation branch from 253c559 to a83c87e Compare June 7, 2025 04:54
@DinGo4DEV
Copy link
Author

Squash commits and update testcases for uuid writer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UUIDType with BucketTransform incorrectly converts int to str in PartitionKey
2 participants