-
Notifications
You must be signed in to change notification settings - Fork 302
fix: correct UUIDType
partition representation for BucketTransform
#2003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DinGo4DEV Thanks for working on this!
It would be good to throw in a test as well. It can be a simple test in test_writes.py
where you write to a bucket UUID table. It also looks like there linter is not happy, could you run make lint
as well?
@DinGo4DEV Again, thanks for working on this. As part of this review, I dug a bit deeper into the issues, and it looks like we're missing the Parquet LogicalTypeAnnotation (apache/arrow#46469) which causes interoperability issues with other readers. |
@Fokko Thank you for taking the time to review. I appreciate your thoughtful feedback and the effort you put into this. To fully support the UUID type, it looks like we'll need to wait for a new Arrow release (> 20.0.0). In the meantime, I’ll continue working on the test cases for my commits. |
@DinGo4DEV Yes, please do. My biggest concern is that we produce Parquet files that will not be supported by other implementations because of the missing logical annotation. Arrow releases pretty often, so it can be resolved within reasonable timespan. |
@DinGo4DEV Good news, it looks like this is fixed in the next release of Arrow: apache/arrow#45866 |
@Fokko TBR, After running the test case, I found that the identity transform of uuid is not supported for writing, because the value is bytes. So I tried rewrite the Avro writer and other related components.
data/uuid_bucket=0/uuid_identity=ec9b663b-062f-4200-a130-8de19c21b800
|- xxxxx.parquet
data/uuid_bucket=1/uuid_identity=5f473c64-dbeb-449b-bdfa-b6b4185b1bde
|- xxxxx.parquet Not sure is that correct and compatible with other integration as I haven't tried partition the uuid with identity before in other projects. |
They also noticed that kind of problem in (apache/iceberg#13087) |
253c559
to
a83c87e
Compare
Squash commits and update testcases for uuid writer |
Rationale for this change
Resolves #2002
Are these changes tested?
Tested Locally. Should add testcases for this later
Are there any user-facing changes?