Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 14, 2025

Rationale

Per the parquet mailing list and the issue #75 it seems that Spark is currently the only open source implementation of Variant available. All tests I could in the spark codebase test the code by roundtripping to JSON rather than using well known binary examples

To facilitate implementations in other languages and systems (such as Rust in arrow-rs) we need binary artifacts to ensure compatibility.

Changes

This PR adds

  1. example binary variant data, for primitive as well as short_string, object and array tpes
  2. The script used to generate the data
  3. Documentation

If people are happy with this approach, I will complete the todo items below

Done:

  • Manually verify binary encodings

Follow on tickets

@alamb alamb changed the title Add example binary variant data and regeneration scripts Example binary variant data and regeneration scripts Apr 16, 2025
@alamb alamb force-pushed the alamb/variant_examples branch from 9c1060c to 5d3d869 Compare April 16, 2025 14:39
@alamb alamb changed the title Example binary variant data and regeneration scripts Add example binary variant data and regeneration scripts Apr 16, 2025
@alamb alamb marked this pull request as ready for review April 16, 2025 14:42
@alamb alamb marked this pull request as draft April 16, 2025 14:43
-- One row with a value from each type listed in
-- https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
--
-- Spark Types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the spark SQL script used to create the various examples

@alamb alamb marked this pull request as ready for review April 16, 2025 14:55
@alamb
Copy link
Contributor Author

alamb commented Apr 16, 2025

I think this is ready for a look. I have spot checked the actual binary values that came out (though I haven't manually checked all of them) and they look as expected

If this format is acceptable I will double check all the values manually

@alamb
Copy link
Contributor Author

alamb commented Apr 28, 2025

🦗

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2025

Today at the Parquet sync @emkornfield said he might have some time to review this PR. If you don't have time, perhaps you can suggest some other people who might be able to review

null
],
"type": "if"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to add a null at the top level here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 8c989a8

variant/regen.py Outdated
INSERT INTO T VALUES ('primitive_float', 1234567890.1234::Float::Variant);
INSERT INTO T VALUES ('primitive_binary', X'31337deadbeefcafe'::Variant);
INSERT INTO T VALUES ('primitive_string', 'This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥️, 🎣 and 🤦!!'::Variant);
-- It is not clear how to create these types using Spark SQL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these were added after the spark implementation, so we likely need either a PR to spark or maybe Rust can take the lead once it has them done.

Copy link

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, it would be nice to at least de-dupe the .gitignore, I think everything else is probably optional.

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2025

Thank you @emkornfield -- I will address your comments shortly and manually review the binary values

@alamb
Copy link
Contributor Author

alamb commented May 2, 2025

I manually reviewed the binary encodings for primitive types and they match VariantEncoding.md as far as I can tell.

I am actually having trouble manually verifying the nested object metadata I will continue to investigate

I did verify that using pyspark built from main as of today still generates the same variant binary values

@alamb alamb requested a review from emkornfield May 2, 2025 18:15
INSERT INTO T VALUES ('primitive_string', 'This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥️, 🎣 and 🤦!!'::Variant);

-- https://github.com/apache/parquet-testing/issues/79
-- is not clear how to create the following types using Spark SQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these types exist in Spark so I don't think they have encoders for them in the Spark Repo

Co-authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>
@alamb
Copy link
Contributor Author

alamb commented May 2, 2025

Thank you for the review @RussellSpitzer

@emkornfield
Copy link

LGTM. thank you @alamb to taking the initiative in driving this forward.

@mapleFU
Copy link
Member

mapleFU commented May 12, 2025

@alamb I noticed that:

  1. decimal is named as {4|8|16} not {32|64|128}
  2. Null object metadata is empty, is this expected?

@alamb
Copy link
Contributor Author

alamb commented May 12, 2025

@alamb I noticed that:

  1. decimal is named as {4|8|16} not {32|64|128}

I tried to follow the naming in the table from VariantEncoding.md which uses those terms

Exact Numeric decimal4 8 DECIMAL(precision, scale) 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric decimal8 9 DECIMAL(precision, scale) 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric decimal16 10 DECIMAL(precision, scale) 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
  1. Null object metadata is empty, is this expected?

This is probably not right -- it is likely an artifact of how spark wrote the parquet file (probably with a parquet null rather than a null in the object). I filed a ticket to track it:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants