Skip to content

GH-46371: [C++][Parquet] Parquet Variant decoding tools #46372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented May 9, 2025

Rationale for this change

This patch supports tool to decode the parquet variant.

What changes are included in this PR?

This patch supports tool to decode the parquet variant.

Are these changes tested?

Yes. I uses parquet-testings. Some problems is listed here: apache/parquet-testing#79

I can also add some hand-written tests after interface is agreed.

Are there any user-facing changes?

Yes, this adds interfaces for decode variant.

Copy link

github-actions bot commented May 9, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@mapleFU mapleFU changed the title [C++][Parquet] Parquet Variant decoding tools GH-46371: [C++][Parquet] Parquet Variant decoding tools May 9, 2025
Copy link

github-actions bot commented May 9, 2025

⚠️ GitHub issue #46371 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 13, 2025
/// \defgroup ValueAccessors
/// @{

// Note: Null doesn't need visitor.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know should we just return an arrow's Scalar, it would be easy to use but in-efficient.

Comment on lines 156 to 160
int8_t getInt8() const;
int16_t getInt16() const;
int32_t getInt32() const;
int64_t getInt64() const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, getInt64 only supports read from int64, which is too strict for integer. I think we can also uses some way to allow getInt64 to get some "smaller types" like int32, int16, int8.

int32_t getInt32() const;
int64_t getInt64() const;
/// Include short_string optimization and primitive string type
std::string_view getString() const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I didn't check utf-8 here.

std::string_view getString() const;
std::string_view getBinary() const;
float getFloat() const;
double getDouble() const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, getDouble only supports read from getFloat, which is too strict for. Maybe we can also uses some way to allow getDouble get other types

}

// checking the element is incremental.
// TODO(mwish): Remove this or encapsulate this range check to function
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should we use a extra function here like "Validate", or just checks them here?

@mapleFU mapleFU marked this pull request as ready for review May 14, 2025 08:10
@mapleFU mapleFU requested a review from wgtmac as a code owner May 14, 2025 08:10
@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch from fb59842 to da142a6 Compare May 14, 2025 08:10
@mapleFU
Copy link
Member Author

mapleFU commented May 14, 2025

@emkornfield @wgtmac @pitrou @zeroshade

This patch add some basic variant decoding tools. Some thoughts:

How would the interface for visiting variant like? The simplist way is cast <metadata, value> pairs to a ptr<::arrow::Scalar>, but this is too slow and needs to read whole data. We can also wraps a std::variant, but I think it's also slow and needs to dynamic dispatch the visitor. Here I just add visitor for every type. And currently, getInt64 would only supports read int64. Any idea is welcome

@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch from da142a6 to 54681c4 Compare May 14, 2025 08:20
Comment on lines +159 to +160
std::string_view metadata_;
uint32_t dictionary_size_{0};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth to change this to below case, which could make this 24B -> 16B

const uint8_t* metadata_ptr_;
const uint32_t metadata_size_;
uint32_t dictionary_size_

@@ -27,6 +27,9 @@
#include <arrow/testing/gtest_util.h>
#include <arrow/util/base64.h>

#include <boost/uuid/uuid.hpp>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use boost just for testing

@mapleFU
Copy link
Member Author

mapleFU commented May 19, 2025

Is it possible to manually add unit tests covering scenarios such as null values, UUID, etc.?

@xxubai I've added some hand-written binary fmt for testing..

@mapleFU mapleFU requested review from pitrou and emkornfield May 21, 2025 08:40
@pitrou
Copy link
Member

pitrou commented May 21, 2025

Please let me find the time to digest the Variant spec and review this. :)

@mapleFU
Copy link
Member Author

mapleFU commented May 21, 2025

Thanks! The read/write implementation will not be too complex without shredding. However design the proper interface would be a little challenge here

/// \brief Get the metadata id for a given key.
/// From the discussion in ML:
/// https://lists.apache.org/thread/b68tjmrjmy64mbv9dknpmqs28vnzjj96 if
/// !sorted_and_unique(), the metadata key is not guaranteed to be unique, so we use a
Copy link

@xxubai xxubai May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check the Iceberg implementation just return the first matched string index(https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/api/src/main/java/org/apache/iceberg/variants/SerializedMetadata.java#L88-L102).

Kindly ask: Why we need to return a vector here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume a {"a": {"a" : 1 } } here. And metadata is duplicate ( standard not require it's unique if !sorted )

The metadata is: "a": 0, "a": 1. Assume inner object gets "a", if only return field_id = 0, it cannot get any info from the inner object, which requires 1 as field_id

Copy link

@xxubai xxubai May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO. In Iceberg and Arrow implementation it returns first matched variant value in a object. But it's different in parquet-java which returns the latest pushed value.

This is a bit confusing for me — please correct me if I misunderstood.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to: https://lists.apache.org/thread/b68tjmrjmy64mbv9dknpmqs28vnzjj96

Keys may appear in nested objects, but cannot appear in the same object. So the first example, {"a": {"a": 1}} is allowed. The second example, {"a": 1, "a": 2} is not allowed.

This parquet-java test prevent from the key in same object. But

If sorted_strings is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. So for object variant, is it acceptable that different reader implementations return a different value if there are duplicate keys in the dictionary?

Copy link
Member Author

@mapleFU mapleFU May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but I think whether the writer promises that it doesn't produce value like this, whether the reader should pay the effort. Otherwise it's a bug

For example, {"a": {"a": 1}} reads "a" in inner object, it has field_id: 1 in field-id list. But the metadata returns 0 for "a", so it returns "not exists" for get "a". In this implementation it should get "field_ids: [0, 1]", and find "1" in it's field-id list.

I think most of the scenerio we don't need care about this, so uses SmallVector here. And if sorted_and_unique, it can have an optimization on find the keys

if (offset_sz < kMinimalOffsetSizeBytes || offset_sz > kMaximumOffsetSizeBytes) {
throw ParquetException("Invalid Variant metadata: invalid offset size: " +
std::to_string(offset_sz));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset_sz already checked in readLittleEndianU32?

  ARROW_DCHECK_LE(size, 4);
  ARROW_DCHECK_GE(size, 1);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCHECK means debug check, it should be regarded as an requires assertion rather than runtime dynamically check

"Invalid Variant metadata: offset out of range: " +
std::to_string((dictionary_size_ + kHeaderSizeBytes) * offset_sz) + " > " +
std::to_string(metadata_.size()));
}
Copy link
Member Author

@mapleFU mapleFU May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In metadata I delayed the validating offsets to getter, it can also put here to validate all offsets are monotonic and the final offset is equal to the metadata size boundary

@pitrou
Copy link
Member

pitrou commented May 27, 2025

Since there seems to be a controversy about a good API for this, how about we start with something lower-level, such as a SAX-like parser (i.e. event-driven)? Then we can build up a higher-level API on top of it once we know which kind of API would be efficient?

An example of event-driven API is in RapidJSON, another is in nlohmann/json
(of course, we should use more modern C++ features)

@@ -300,8 +300,7 @@ VariantType VariantValue::getType() const {
VariantBasicType basic_type = getBasicType();
switch (basic_type) {
case VariantBasicType::Primitive: {
auto primitive_type =
static_cast<VariantPrimitiveType>(value_[0] >> kValueHeaderBitShift);
auto primitive_type = static_cast<VariantPrimitiveType>(valueHeader());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to push the static_cast<VariantPrimitiveType> call into the valueHeader method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. valueHeader seems to be just doing >> kValueHeaderBitShift, which means that the primitive type would be valueHeader() & 0x3f, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's not primitive type, value_header might be length for string, or be used as field_offset_length etc

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 27, 2025
@mapleFU
Copy link
Member Author

mapleFU commented May 28, 2025

Since there seems to be a controversy about a good API for this, how about we start with something lower-level, such as a SAX-like parser (i.e. event-driven)? Then we can build up a higher-level API on top of it once we know which kind of API would be efficient?

@pitrou SAX style parser means consume the whole token and parsing it to a variant object. It's ok when we'd like to parse the whole object 🤔, however, maybe it would be a bit expansive when we want to visit one or multiple columns? Currently I just implement a Variant object for this

@pitrou
Copy link
Member

pitrou commented May 28, 2025

@pitrou SAX style parser means consume the whole token and parsing it to a variant object.

It does not. See https://rapidjson.org/md_doc_sax.html

@mapleFU
Copy link
Member Author

mapleFU commented May 28, 2025

Emm i mean, for writer it might be good, for reader, get a key not requires visit/parse the whole binary with Key("...") api?

@pitrou
Copy link
Member

pitrou commented May 28, 2025

How would you avoid parsing the whole binary?

@mapleFU
Copy link
Member Author

mapleFU commented May 28, 2025

Emm it's by the design of parquet variant format. In memory there is VariantObject, VariantObject::Get(key) would search key in metadata, find field_id, and search for field_id in object variant. These operations are operated directed in variant binary format

@wgtmac
Copy link
Member

wgtmac commented May 29, 2025

How would you avoid parsing the whole binary?

The variant spec has provided sufficient metadata (dictionary of all keys and offset to any value) to jump into a key at arbitrary nesting level without decoding irrelevant binary data. It is already a parsed binary. This is something that XML or JSON texts cannot do.

@pitrou
Copy link
Member

pitrou commented May 29, 2025

Thanks for the explanation @wgtmac . I retract my suggestion then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants