Add example binary variant data and regeneration scripts #76

alamb · 2025-04-14T11:38:12Z

Part of Add example Variant data and parquet files #75
Related to Variant Support for Arrow and Parquet [DRAFT] arrow-rs#7404

Rationale

Per the parquet mailing list and the issue #75 it seems that Spark is currently the only open source implementation of Variant available. All tests I could in the spark codebase test the code by roundtripping to JSON rather than using well known binary examples

To facilitate implementations in other languages and systems (such as Rust in arrow-rs) we need binary artifacts to ensure compatibility.

Changes

This PR adds

example binary variant data, for primitive as well as short_string, object and array tpes
The script used to generate the data
Documentation

If people are happy with this approach, I will complete the todo items below

Done:

Manually verify binary encodings

Follow on tickets

alamb · 2025-04-16T14:55:02Z

variant/regen.py

+-- One row with a value from each type listed in 
+-- https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
+--
+-- Spark Types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html


here is the spark SQL script used to create the various examples

alamb · 2025-04-16T14:56:00Z

I think this is ready for a look. I have spot checked the actual binary values that came out (though I haven't manually checked all of them) and they look as expected

If this format is acceptable I will double check all the values manually

alamb · 2025-04-28T15:17:26Z

🦗

alamb · 2025-04-30T17:37:31Z

Today at the Parquet sync @emkornfield said he might have some time to review this PR. If you don't have time, perhaps you can suggest some other people who might be able to review

.gitignore

variant/.gitignore

emkornfield · 2025-04-30T18:01:40Z

variant/data_dictionary.json

+                null
+            ],
+            "type": "if"
+        }


it might be nice to add a null at the top level here?

done in 8c989a8

emkornfield · 2025-04-30T18:05:38Z

variant/regen.py

+INSERT INTO T VALUES ('primitive_float', 1234567890.1234::Float::Variant);
+INSERT INTO T VALUES ('primitive_binary', X'31337deadbeefcafe'::Variant);
+INSERT INTO T VALUES ('primitive_string', 'This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥️, 🎣 and 🤦!!'::Variant);
+-- It is not clear how to create these types using Spark SQL


I think these were added after the spark implementation, so we likely need either a PR to spark or maybe Rust can take the lead once it has them done.

emkornfield

Mostly looks good, it would be nice to at least de-dupe the .gitignore, I think everything else is probably optional.

alamb · 2025-04-30T19:13:03Z

Thank you @emkornfield -- I will address your comments shortly and manually review the binary values

alamb · 2025-05-02T15:42:49Z

I manually reviewed the binary encodings for primitive types and they match VariantEncoding.md as far as I can tell.

I am actually having trouble manually verifying the nested object metadata I will continue to investigate

I did verify that using pyspark built from main as of today still generates the same variant binary values

variant/regen.py

RussellSpitzer · 2025-05-02T21:01:39Z

variant/regen.py

+INSERT INTO T VALUES ('primitive_string', 'This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥️, 🎣 and 🤦!!'::Variant);
+
+-- https://github.com/apache/parquet-testing/issues/79
+-- is not clear how to create the following types using Spark SQL


None of these types exist in Spark so I don't think they have encoders for them in the Spark Repo

variant/regen.py

Co-authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>

alamb · 2025-05-02T21:07:01Z

Thank you for the review @RussellSpitzer

emkornfield · 2025-05-03T06:42:01Z

LGTM. thank you @alamb to taking the initiative in driving this forward.

mapleFU · 2025-05-12T08:51:18Z

@alamb I noticed that:

decimal is named as {4|8|16} not {32|64|128}
Null object metadata is empty, is this expected?

alamb · 2025-05-12T10:14:54Z

@alamb I noticed that:

decimal is named as {4|8|16} not {32|64|128}

I tried to follow the naming in the table from VariantEncoding.md which uses those terms

Exact Numeric	decimal4	8	DECIMAL(precision, scale)	1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric	decimal8	9	DECIMAL(precision, scale)	1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)
Exact Numeric	decimal16	10	DECIMAL(precision, scale)	1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table)

Null object metadata is empty, is this expected?

This is probably not right -- it is likely an artifact of how spark wrote the parquet file (probably with a parquet null rather than a null in the object). I filed a ticket to track it:

Potential issues with Null variant example #81 to track

alamb changed the title ~~Add example binary variant data and regeneration scripts~~ Example binary variant data and regeneration scripts Apr 16, 2025

Add example binary variant data

5d3d869

alamb force-pushed the alamb/variant_examples branch from 9c1060c to 5d3d869 Compare April 16, 2025 14:39

alamb changed the title ~~Example binary variant data and regeneration scripts~~ Add example binary variant data and regeneration scripts Apr 16, 2025

alamb marked this pull request as ready for review April 16, 2025 14:42

alamb marked this pull request as draft April 16, 2025 14:43

Inline json representation

b199636

alamb commented Apr 16, 2025

View reviewed changes

alamb marked this pull request as ready for review April 16, 2025 14:55

This was referenced Apr 16, 2025

Add example Variant data and parquet files #75

Closed

[Variant] Rust API to Read Variant Values apache/arrow-rs#7423

Closed

emkornfield reviewed Apr 30, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

emkornfield reviewed Apr 30, 2025

View reviewed changes

variant/.gitignore Outdated Show resolved Hide resolved

emkornfield reviewed Apr 30, 2025

View reviewed changes

emkornfield requested changes Apr 30, 2025

View reviewed changes

alamb mentioned this pull request May 1, 2025

Weekly Plan: Andrew Lamb 2025-04-28 apache/datafusion#15880

Closed

26 tasks

alamb added 3 commits May 2, 2025 08:28

Improve readme

444ccfd

Add null at top level of nested struct, improve comments

8c989a8

Use different value for embedded field for clarity

56695a4

Add ticket links

61fc409

alamb requested a review from emkornfield May 2, 2025 18:15

RussellSpitzer reviewed May 2, 2025

View reviewed changes

variant/regen.py Outdated Show resolved Hide resolved

RussellSpitzer reviewed May 2, 2025

View reviewed changes

variant/regen.py Outdated Show resolved Hide resolved

RussellSpitzer approved these changes May 2, 2025

View reviewed changes

alamb commented May 2, 2025

View reviewed changes

variant/regen.py Outdated Show resolved Hide resolved

Apply suggestions from code review

10ca289

Co-authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>

emkornfield approved these changes May 3, 2025

View reviewed changes

emkornfield merged commit 2dc8bf1 into apache:master May 3, 2025

alamb deleted the alamb/variant_examples branch May 3, 2025 10:30

This was referenced May 5, 2025

Add string coercion when decoding json apache/arrow-rs#7453

Open

[Variant] Add (empty) parquet-variant crate, update parquet-testing pin apache/arrow-rs#7485

Merged

[EPIC] [Parquet] Implement Variant type support in Parquet apache/arrow-rs#6736

Closed

alamb mentioned this pull request May 12, 2025

Potential issues with Null variant example #81

Closed

Add example binary variant data and regeneration scripts #76

Add example binary variant data and regeneration scripts #76

Uh oh!

Conversation

alamb commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Changes

Done:

Follow on tickets

Uh oh!

alamb Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 16, 2025

Uh oh!

alamb commented Apr 28, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

emkornfield Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 2, 2025

Choose a reason for hiding this comment

Uh oh!

emkornfield Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

alamb commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

RussellSpitzer May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented May 2, 2025

Uh oh!

emkornfield commented May 3, 2025

Uh oh!

mapleFU commented May 12, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

Uh oh!

alamb commented Apr 14, 2025 •

edited

Loading

alamb commented May 2, 2025 •

edited

Loading