-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46371: [C++][Parquet] Parquet Variant decoding tools #46372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
|
/// \defgroup ValueAccessors | ||
/// @{ | ||
|
||
// Note: Null doesn't need visitor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know should we just return an arrow's Scalar
, it would be easy to use but in-efficient.
cpp/src/parquet/variant.h
Outdated
int8_t getInt8() const; | ||
int16_t getInt16() const; | ||
int32_t getInt32() const; | ||
int64_t getInt64() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, getInt64
only supports read from int64, which is too strict for integer. I think we can also uses some way to allow getInt64 to get some "smaller types" like int32, int16, int8.
int32_t getInt32() const; | ||
int64_t getInt64() const; | ||
/// Include short_string optimization and primitive string type | ||
std::string_view getString() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I didn't check utf-8 here.
std::string_view getString() const; | ||
std::string_view getBinary() const; | ||
float getFloat() const; | ||
double getDouble() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, getDouble
only supports read from getFloat
, which is too strict for. Maybe we can also uses some way to allow getDouble
get other types
cpp/src/parquet/variant.cc
Outdated
} | ||
|
||
// checking the element is incremental. | ||
// TODO(mwish): Remove this or encapsulate this range check to function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think should we use a extra function here like "Validate", or just checks them here?
fb59842
to
da142a6
Compare
@emkornfield @wgtmac @pitrou @zeroshade This patch add some basic variant decoding tools. Some thoughts: How would the interface for visiting variant like? The simplist way is cast |
da142a6
to
54681c4
Compare
std::string_view metadata_; | ||
uint32_t dictionary_size_{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth to change this to below case, which could make this 24B -> 16B
const uint8_t* metadata_ptr_;
const uint32_t metadata_size_;
uint32_t dictionary_size_
@@ -27,6 +27,9 @@ | |||
#include <arrow/testing/gtest_util.h> | |||
#include <arrow/util/base64.h> | |||
|
|||
#include <boost/uuid/uuid.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use boost just for testing
@xxubai I've added some hand-written binary fmt for testing.. |
Please let me find the time to digest the Variant spec and review this. :) |
Thanks! The read/write implementation will not be too complex without shredding. However design the proper interface would be a little challenge here |
/// \brief Get the metadata id for a given key. | ||
/// From the discussion in ML: | ||
/// https://lists.apache.org/thread/b68tjmrjmy64mbv9dknpmqs28vnzjj96 if | ||
/// !sorted_and_unique(), the metadata key is not guaranteed to be unique, so we use a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I check the Iceberg implementation just return the first matched string index(https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/api/src/main/java/org/apache/iceberg/variants/SerializedMetadata.java#L88-L102).
Kindly ask: Why we need to return a vector here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume a {"a": {"a" : 1 } }
here. And metadata is duplicate ( standard not require it's unique if !sorted )
The metadata is: "a": 0, "a": 1
. Assume inner object gets "a", if only return field_id = 0, it cannot get any info from the inner object, which requires 1 as field_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO. In Iceberg and Arrow implementation it returns first matched variant value in a object. But it's different in parquet-java which returns the latest pushed value.
This is a bit confusing for me — please correct me if I misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to: https://lists.apache.org/thread/b68tjmrjmy64mbv9dknpmqs28vnzjj96
Keys may appear in nested objects, but cannot appear in the same object. So the first example, {"a": {"a": 1}} is allowed. The second example, {"a": 1, "a": 2} is not allowed.
This parquet-java test prevent from the key in same object. But
If sorted_strings is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. So for object variant, is it acceptable that different reader implementations return a different value if there are duplicate keys in the dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but I think whether the writer promises that it doesn't produce value like this, whether the reader should pay the effort. Otherwise it's a bug
For example, {"a": {"a": 1}}
reads "a" in inner object, it has field_id: 1
in field-id list. But the metadata returns 0 for "a", so it returns "not exists" for get "a". In this implementation it should get "field_ids: [0, 1]", and find "1" in it's field-id list.
I think most of the scenerio we don't need care about this, so uses SmallVector
here. And if sorted_and_unique
, it can have an optimization on find the keys
if (offset_sz < kMinimalOffsetSizeBytes || offset_sz > kMaximumOffsetSizeBytes) { | ||
throw ParquetException("Invalid Variant metadata: invalid offset size: " + | ||
std::to_string(offset_sz)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset_sz already checked in readLittleEndianU32
?
ARROW_DCHECK_LE(size, 4);
ARROW_DCHECK_GE(size, 1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DCHECK means debug check, it should be regarded as an requires assertion rather than runtime dynamically check
"Invalid Variant metadata: offset out of range: " + | ||
std::to_string((dictionary_size_ + kHeaderSizeBytes) * offset_sz) + " > " + | ||
std::to_string(metadata_.size())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In metadata I delayed the validating offsets to getter, it can also put here to validate all offsets are monotonic and the final offset is equal to the metadata size boundary
Since there seems to be a controversy about a good API for this, how about we start with something lower-level, such as a SAX-like parser (i.e. event-driven)? Then we can build up a higher-level API on top of it once we know which kind of API would be efficient? An example of event-driven API is in RapidJSON, another is in nlohmann/json |
@@ -300,8 +300,7 @@ VariantType VariantValue::getType() const { | |||
VariantBasicType basic_type = getBasicType(); | |||
switch (basic_type) { | |||
case VariantBasicType::Primitive: { | |||
auto primitive_type = | |||
static_cast<VariantPrimitiveType>(value_[0] >> kValueHeaderBitShift); | |||
auto primitive_type = static_cast<VariantPrimitiveType>(valueHeader()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to push the static_cast<VariantPrimitiveType>
call into the valueHeader
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. valueHeader
seems to be just doing >> kValueHeaderBitShift
, which means that the primitive type would be valueHeader() & 0x3f
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's not primitive type, value_header
might be length for string, or be used as field_offset_length
etc
@pitrou SAX style parser means consume the whole token and parsing it to a variant object. It's ok when we'd like to parse the whole object 🤔, however, maybe it would be a bit expansive when we want to visit one or multiple columns? Currently I just implement a |
It does not. See https://rapidjson.org/md_doc_sax.html |
Emm i mean, for writer it might be good, for reader, get a key not requires visit/parse the whole binary with |
How would you avoid parsing the whole binary? |
Emm it's by the design of parquet variant format. In memory there is |
The variant spec has provided sufficient metadata (dictionary of all keys and offset to any value) to jump into a key at arbitrary nesting level without decoding irrelevant binary data. It is already a parsed binary. This is something that XML or JSON texts cannot do. |
Thanks for the explanation @wgtmac . I retract my suggestion then. |
Rationale for this change
This patch supports tool to decode the parquet variant.
What changes are included in this PR?
This patch supports tool to decode the parquet variant.
Are these changes tested?
Yes. I uses parquet-testings. Some problems is listed here: apache/parquet-testing#79
I can also add some hand-written tests after interface is agreed.
Are there any user-facing changes?
Yes, this adds interfaces for decode variant.