Description
Describe the enhancement requested
The patch[1] does the decoding logics for Variant type. The encoding is the another part of the variant type. Since VariantValue
is a wrapper for (std::string_view, std::string_view)
, we should more carefully on design the variant encoding tools.
The code I need to follow is the code from parquet-java [2], iceberg [3] and arrow-rs [4]. Some apis in C++ like velocypack [5] is also taken into account.
Set API
The first logics we need take care is how the setter api works. We focus on two parts: primitive types and complex types.
Complex types
Builder style api?
We need consider the complex type before the normal type. In a builder, a complex type is coming before the primitive type. The builder might have some modes:
- set a primitive type directly
- set key-value pair in object (str key, value)
- append the values in array
Besides, the nested object add key always added to meta dictionary.
The api might be:
void startObject();
void endObject();
void startArray();
void endArray();
// Append key in object.
// This should be called in a object
void appendKey(...);
// Primitive api
// When in an object, appendInt should be called with key exists
// When in an array, just append
// When not an object or an array, this directly set the int.
void appendInt(..);
We might also suply the builder helper:
void setInt(key, value);
Value style api?
We can also support a VariantBuilder
. And subtypes can have VariantObjectBuilder
, VariantPrimitiveContainer
, VariantArrayBuilder
.
VariantBuilder can produce bytes, which could build array bottom-up
struct VariantBuilder {
std::string build();
uint32_t addDictionaryKey(std::string_view key);
};
struct VariantObjectBuilder {
void set(uint32_t field_id, std::string_view);
};
struct VariantArrayBuilder {
void append(std::string_view);
void appends(std::span<std::string_view>);
};
Primitive types
For types like timestamp, null, boolean(true, false), api can set directly.
For types like double / float, user might set it themselve
For types like Integer, we might provide setInt8, setInt16, setInt32, setInt64 or:
void addInt(int64_t v) {
if (v can be repr in int8) {
// int8
} else if (v can be repr in int16) {
// int16
} else {
// ...
}
}
Buffer and view handling
Buffer handling is also we should consider. Assume a one level builder for objects like:
{"a": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}
...
We might need:
- allocate temp buffer for inner object to store field_id and offsets
- A pooled buffer for built strings
- A ordered set for sorted keys, or a unordered vector for unsorted keys
- final buffer for output
The buffer handling would be a place to consider here:
- For output, we can have two mode (1)
std::string finish()
for output (2) append to output buffer - For inner buffer, we can use
Arena
style management rule
Reference
[1] #46372
[2] https://github.com/apache/parquet-java/blob/1f1e07bbf750fba228851c2d63470c3da5726831/parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java#L31
[3] https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/parquet/src/main/java/org/apache/iceberg/parquet/VariantWriterBuilder.java#L46
[4] apache/arrow-rs#7424
[5] https://github.com/arangodb/velocypack
Component(s)
C++, Parquet