-
Notifications
You must be signed in to change notification settings - Fork 67
Add example binary variant data and regeneration scripts #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9c1060c
to
5d3d869
Compare
-- One row with a value from each type listed in | ||
-- https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types | ||
-- | ||
-- Spark Types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the spark SQL script used to create the various examples
I think this is ready for a look. I have spot checked the actual binary values that came out (though I haven't manually checked all of them) and they look as expected If this format is acceptable I will double check all the values manually |
🦗 |
Today at the Parquet sync @emkornfield said he might have some time to review this PR. If you don't have time, perhaps you can suggest some other people who might be able to review |
variant/data_dictionary.json
Outdated
null | ||
], | ||
"type": "if" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be nice to add a null at the top level here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 8c989a8
variant/regen.py
Outdated
INSERT INTO T VALUES ('primitive_float', 1234567890.1234::Float::Variant); | ||
INSERT INTO T VALUES ('primitive_binary', X'31337deadbeefcafe'::Variant); | ||
INSERT INTO T VALUES ('primitive_string', 'This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥️, 🎣 and 🤦!!'::Variant); | ||
-- It is not clear how to create these types using Spark SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these were added after the spark implementation, so we likely need either a PR to spark or maybe Rust can take the lead once it has them done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, it would be nice to at least de-dupe the .gitignore, I think everything else is probably optional.
Thank you @emkornfield -- I will address your comments shortly and manually review the binary values |
I manually reviewed the binary encodings for primitive types and they match I am actually having trouble manually verifying the nested object metadata I will continue to investigate I did verify that using pyspark built from main as of today still generates the same variant binary values |
INSERT INTO T VALUES ('primitive_string', 'This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥️, 🎣 and 🤦!!'::Variant); | ||
|
||
-- https://github.com/apache/parquet-testing/issues/79 | ||
-- is not clear how to create the following types using Spark SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these types exist in Spark so I don't think they have encoders for them in the Spark Repo
Co-authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>
Thank you for the review @RussellSpitzer |
LGTM. thank you @alamb to taking the initiative in driving this forward. |
@alamb I noticed that:
|
I tried to follow the naming in the table from VariantEncoding.md which uses those terms
This is probably not right -- it is likely an artifact of how spark wrote the parquet file (probably with a parquet null rather than a null in the object). I filed a ticket to track it: |
Rationale
Per the parquet mailing list and the issue #75 it seems that Spark is currently the only open source implementation of Variant available. All tests I could in the spark codebase test the code by roundtripping to JSON rather than using well known binary examples
To facilitate implementations in other languages and systems (such as Rust in arrow-rs) we need binary artifacts to ensure compatibility.
Changes
This PR adds
If people are happy with this approach, I will complete the todo items below
Done:
Follow on tickets