GH-46371: [C++][Parquet] Parquet Variant decoding tools #46372

mapleFU · 2025-05-09T10:37:43Z

Rationale for this change

This patch supports tool to decode the parquet variant.

What changes are included in this PR?

This patch supports tool to decode the parquet variant.

Are these changes tested?

Yes. I uses parquet-testings. Some problems is listed here: apache/parquet-testing#79

I can also add some hand-written tests after interface is agreed.

Are there any user-facing changes?

Yes, this adds interfaces for decode variant.

GitHub Issue: [C++][Parquet] Variant Type common decoding visitor #46371

github-actions · 2025-05-09T10:38:08Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2025-05-09T11:34:24Z

⚠️ GitHub issue #46371 has been automatically assigned in GitHub to PR creator.

…pp-decoder-tools

cpp/src/parquet/variant_test.cpp

cpp/src/parquet/variant.h

mapleFU · 2025-05-13T11:50:05Z

cpp/src/parquet/variant.h

+  /// \defgroup ValueAccessors
+  /// @{
+
+  // Note: Null doesn't need visitor.


I don't know should we just return an arrow's Scalar, it would be easy to use but in-efficient.

cpp/src/parquet/variant.h

cpp/src/parquet/variant.cc

mapleFU · 2025-05-14T08:06:51Z

cpp/src/parquet/variant.h

+  int8_t getInt8() const;
+  int16_t getInt16() const;
+  int32_t getInt32() const;
+  int64_t getInt64() const;


Currently, getInt64 only supports read from int64, which is too strict for integer. I think we can also uses some way to allow getInt64 to get some "smaller types" like int32, int16, int8.

mapleFU · 2025-05-14T08:07:05Z

cpp/src/parquet/variant.h

+  int32_t getInt32() const;
+  int64_t getInt64() const;
+  /// Include short_string optimization and primitive string type
+  std::string_view getString() const;


Currently I didn't check utf-8 here.

mapleFU · 2025-05-14T08:07:50Z

cpp/src/parquet/variant.h

+  std::string_view getString() const;
+  std::string_view getBinary() const;
+  float getFloat() const;
+  double getDouble() const;


Currently, getDouble only supports read from getFloat, which is too strict for. Maybe we can also uses some way to allow getDouble get other types

mapleFU · 2025-05-14T08:08:56Z

cpp/src/parquet/variant.cc

+  }
+
+  // checking the element is incremental.
+  // TODO(mwish): Remove this or encapsulate this range check to function


I think should we use a extra function here like "Validate", or just checks them here?

cpp/src/parquet/variant.cc

mapleFU · 2025-05-14T08:20:07Z

@emkornfield @wgtmac @pitrou @zeroshade

This patch add some basic variant decoding tools. Some thoughts:

How would the interface for visiting variant like? The simplist way is cast <metadata, value> pairs to a ptr<::arrow::Scalar>, but this is too slow and needs to read whole data. We can also wraps a std::variant, but I think it's also slow and needs to dynamic dispatch the visitor. Here I just add visitor for every type. And currently, getInt64 would only supports read int64. Any idea is welcome

mapleFU · 2025-05-16T21:15:13Z

cpp/src/parquet/variant.h

+  std::string_view metadata_;
+  uint32_t dictionary_size_{0};


I wonder if it's worth to change this to below case, which could make this 24B -> 16B

const uint8_t* metadata_ptr_; const uint32_t metadata_size_; uint32_t dictionary_size_

mapleFU · 2025-05-19T09:38:51Z

cpp/src/parquet/variant_test.cc

@@ -27,6 +27,9 @@
 #include <arrow/testing/gtest_util.h>
 #include <arrow/util/base64.h>

+#include <boost/uuid/uuid.hpp>


use boost just for testing

mapleFU · 2025-05-19T09:39:36Z

Is it possible to manually add unit tests covering scenarios such as null values, UUID, etc.?

@xxubai I've added some hand-written binary fmt for testing..

pitrou · 2025-05-21T08:49:39Z

Please let me find the time to digest the Variant spec and review this. :)

mapleFU · 2025-05-21T10:53:27Z

Thanks! The read/write implementation will not be too complex without shredding. However design the proper interface would be a little challenge here

xxubai · 2025-05-22T06:46:21Z

cpp/src/parquet/variant.h

+  /// \brief Get the metadata id for a given key.
+  /// From the discussion in ML:
+  /// https://lists.apache.org/thread/b68tjmrjmy64mbv9dknpmqs28vnzjj96 if
+  /// !sorted_and_unique(), the metadata key is not guaranteed to be unique, so we use a


I check the Iceberg implementation just return the first matched string index(https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/api/src/main/java/org/apache/iceberg/variants/SerializedMetadata.java#L88-L102).

Kindly ask: Why we need to return a vector here?

Assume a {"a": {"a" : 1 } } here. And metadata is duplicate ( standard not require it's unique if !sorted )

The metadata is: "a": 0, "a": 1. Assume inner object gets "a", if only return field_id = 0, it cannot get any info from the inner object, which requires 1 as field_id

IMO. In Iceberg and Arrow implementation it returns first matched variant value in a object. But it's different in parquet-java which returns the latest pushed value.

This is a bit confusing for me — please correct me if I misunderstood.

Refer to: https://lists.apache.org/thread/b68tjmrjmy64mbv9dknpmqs28vnzjj96

Keys may appear in nested objects, but cannot appear in the same object. So the first example, {"a": {"a": 1}} is allowed. The second example, {"a": 1, "a": 2} is not allowed.

This parquet-java test prevent from the key in same object. But

If sorted_strings is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.

Okay. So for object variant, is it acceptable that different reader implementations return a different value if there are duplicate keys in the dictionary?

I don't know, but I think whether the writer promises that it doesn't produce value like this, whether the reader should pay the effort. Otherwise it's a bug

For example, {"a": {"a": 1}} reads "a" in inner object, it has field_id: 1 in field-id list. But the metadata returns 0 for "a", so it returns "not exists" for get "a". In this implementation it should get "field_ids: [0, 1]", and find "1" in it's field-id list.

I think most of the scenerio we don't need care about this, so uses SmallVector here. And if sorted_and_unique, it can have an optimization on find the keys

xxubai · 2025-05-23T02:32:46Z

cpp/src/parquet/variant.cc

+  if (offset_sz < kMinimalOffsetSizeBytes || offset_sz > kMaximumOffsetSizeBytes) {
+    throw ParquetException("Invalid Variant metadata: invalid offset size: " +
+                           std::to_string(offset_sz));
+  }


offset_sz already checked in readLittleEndianU32?

ARROW_DCHECK_LE(size, 4); ARROW_DCHECK_GE(size, 1);

DCHECK means debug check, it should be regarded as an requires assertion rather than runtime dynamically check

mapleFU · 2025-05-23T22:13:25Z

cpp/src/parquet/variant.cc

+        "Invalid Variant metadata: offset out of range: " +
+        std::to_string((dictionary_size_ + kHeaderSizeBytes) * offset_sz) + " > " +
+        std::to_string(metadata_.size()));
+  }


In metadata I delayed the validating offsets to getter, it can also put here to validate all offsets are monotonic and the final offset is equal to the metadata size boundary

pitrou · 2025-05-27T07:46:14Z

Since there seems to be a controversy about a good API for this, how about we start with something lower-level, such as a SAX-like parser (i.e. event-driven)? Then we can build up a higher-level API on top of it once we know which kind of API would be efficient?

An example of event-driven API is in RapidJSON, another is in nlohmann/json
(of course, we should use more modern C++ features)

zeroshade · 2025-05-27T16:04:55Z

cpp/src/parquet/variant.cc

@@ -300,8 +300,7 @@ VariantType VariantValue::getType() const {
  VariantBasicType basic_type = getBasicType();
  switch (basic_type) {
    case VariantBasicType::Primitive: {
-      auto primitive_type =
-          static_cast<VariantPrimitiveType>(value_[0] >> kValueHeaderBitShift);
+      auto primitive_type = static_cast<VariantPrimitiveType>(valueHeader());


any reason not to push the static_cast<VariantPrimitiveType> call into the valueHeader method?

I'm a bit confused. valueHeader seems to be just doing >> kValueHeaderBitShift, which means that the primitive type would be valueHeader() & 0x3f, right?

if it's not primitive type, value_header might be length for string, or be used as field_offset_length etc

mapleFU · 2025-05-28T16:53:56Z

Since there seems to be a controversy about a good API for this, how about we start with something lower-level, such as a SAX-like parser (i.e. event-driven)? Then we can build up a higher-level API on top of it once we know which kind of API would be efficient?

@pitrou SAX style parser means consume the whole token and parsing it to a variant object. It's ok when we'd like to parse the whole object 🤔, however, maybe it would be a bit expansive when we want to visit one or multiple columns? Currently I just implement a Variant object for this

pitrou · 2025-05-28T16:58:14Z

@pitrou SAX style parser means consume the whole token and parsing it to a variant object.

It does not. See https://rapidjson.org/md_doc_sax.html

mapleFU · 2025-05-28T17:03:28Z

Emm i mean, for writer it might be good, for reader, get a key not requires visit/parse the whole binary with Key("...") api?

pitrou · 2025-05-28T17:04:36Z

How would you avoid parsing the whole binary?

mapleFU · 2025-05-28T17:07:23Z

Emm it's by the design of parquet variant format. In memory there is VariantObject, VariantObject::Get(key) would search key in metadata, find field_id, and search for field_id in object variant. These operations are operated directed in variant binary format

wgtmac · 2025-05-29T02:25:40Z

How would you avoid parsing the whole binary?

The variant spec has provided sufficient metadata (dictionary of all keys and offset to any value) to jump into a key at arbitrary nesting level without decoding irrelevant binary data. It is already a parsed binary. This is something that XML or JSON texts cannot do.

pitrou · 2025-05-29T07:37:22Z

Thanks for the explanation @wgtmac . I retract my suggestion then.

Variant tools

5c43465

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels May 9, 2025

mapleFU changed the title ~~[C++][Parquet] Parquet Variant decoding tools~~ GH-46371: [C++][Parquet] Parquet Variant decoding tools May 9, 2025

mapleFU added 4 commits May 12, 2025 16:12

Merge branch 'main' of https://github.com/apache/arrow into variant-c…

ec7d9e0

…pp-decoder-tools

metadata logic impl

a4599d8

add some value interfaces

34c1d2c

Merge branch 'main' into variant-cpp-decoder-tools

1083a16

mapleFU commented May 13, 2025

View reviewed changes

cpp/src/parquet/variant_test.cpp Outdated Show resolved Hide resolved

cpp/src/parquet/variant.h Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 13, 2025

mapleFU added 5 commits May 13, 2025 16:28

Add basic tests (will failed)

ad585f2

Fix all primitive type tests( time, uuid not included)

225b39b

skeleton for object parsing ( bug might exists )

80a37ba

eliminate duplicate code

fc4b721

Basic implement array (test not runned)

e8cdad7

mapleFU commented May 13, 2025

View reviewed changes

mapleFU added 2 commits May 14, 2025 14:44

Finish code of uuid, complete some tests

34f546e

Cleanup for review

3566dfd

mapleFU commented May 14, 2025

View reviewed changes

mapleFU marked this pull request as ready for review May 14, 2025 08:10

mapleFU requested a review from wgtmac as a code owner May 14, 2025 08:10

mapleFU force-pushed the variant-cpp-decoder-tools branch from fb59842 to da142a6 Compare May 14, 2025 08:10

continue cleanup

54681c4

mapleFU force-pushed the variant-cpp-decoder-tools branch from da142a6 to 54681c4 Compare May 14, 2025 08:20

Try to fix lint

7759b03

change str to view

a83f8cc

mapleFU commented May 16, 2025

View reviewed changes

mapleFU added 3 commits May 17, 2025 11:52

Revert toString return view firstly

17a0637

Fix ci

01ec5ea

Add basic test and enhance interface for timestamp and uuid type

925a062

mapleFU commented May 19, 2025

View reviewed changes

mapleFU requested review from pitrou and emkornfield May 21, 2025 08:40

alamb mentioned this pull request May 21, 2025

Variant: Rust API to Read Variant Values apache/arrow-rs#7423

Closed

xxubai reviewed May 22, 2025

View reviewed changes

mapleFU mentioned this pull request May 22, 2025

[C++][Parquet] Encoding tools for variant type #46555

Open

xxubai reviewed May 23, 2025

View reviewed changes

mapleFU added 2 commits May 24, 2025 02:41

Enhancement

4630680

cleanup some constant using

e6c5963

mapleFU commented May 23, 2025

View reviewed changes

Fix the logic for signed shift

261cde1

zeroshade reviewed May 27, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 27, 2025

GH-46371: [C++][Parquet] Parquet Variant decoding tools #46372

Are you sure you want to change the base?

GH-46371: [C++][Parquet] Parquet Variant decoding tools #46372

Uh oh!

Conversation

mapleFU commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mapleFU commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mapleFU commented May 19, 2025

Uh oh!

pitrou commented May 21, 2025

Uh oh!

mapleFU commented May 21, 2025

Uh oh!

xxubai May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xxubai May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mapleFU May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mapleFU May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mapleFU commented May 9, 2025 •

edited

Loading

xxubai May 22, 2025 •

edited

Loading

xxubai May 22, 2025 •

edited

Loading

mapleFU May 22, 2025 •

edited

Loading

mapleFU May 23, 2025 •

edited

Loading

pitrou commented May 27, 2025 •

edited

Loading