Skip to content

[C++][Parquet] Encoding tools for variant type #46555

Open
@mapleFU

Description

@mapleFU

Describe the enhancement requested

The patch[1] does the decoding logics for Variant type. The encoding is the another part of the variant type. Since VariantValue is a wrapper for (std::string_view, std::string_view), we should more carefully on design the variant encoding tools.

The code I need to follow is the code from parquet-java [2], iceberg [3] and arrow-rs [4]. Some apis in C++ like velocypack [5] is also taken into account.

Set API

The first logics we need take care is how the setter api works. We focus on two parts: primitive types and complex types.

Complex types

Builder style api?

We need consider the complex type before the normal type. In a builder, a complex type is coming before the primitive type. The builder might have some modes:

  1. set a primitive type directly
  2. set key-value pair in object (str key, value)
  3. append the values in array

Besides, the nested object add key always added to meta dictionary.

The api might be:

void startObject();
void endObject();
void startArray();
void endArray();
// Append key in object.
// This should be called in a object
void appendKey(...);
// Primitive api
// When in an object, appendInt should be called with key exists
// When in an array, just append
// When not an object or an array, this directly set the int.
void appendInt(..);

We might also suply the builder helper:

void setInt(key, value);

Value style api?

We can also support a VariantBuilder. And subtypes can have VariantObjectBuilder, VariantPrimitiveContainer, VariantArrayBuilder.

VariantBuilder can produce bytes, which could build array bottom-up

struct VariantBuilder {
  std::string build();
  uint32_t addDictionaryKey(std::string_view key);
};

struct VariantObjectBuilder {
   void set(uint32_t field_id, std::string_view);
};

struct VariantArrayBuilder {
   void append(std::string_view);
   void appends(std::span<std::string_view>);
};

Primitive types

For types like timestamp, null, boolean(true, false), api can set directly.

For types like double / float, user might set it themselve

For types like Integer, we might provide setInt8, setInt16, setInt32, setInt64 or:

  void addInt(int64_t v) {
    if (v can be repr in int8) {
      // int8
    } else if (v can be repr in int16) {
      // int16
    } else {
      // ...
    }
  }

Buffer and view handling

Buffer handling is also we should consider. Assume a one level builder for objects like:

{"a": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}
...

We might need:

  1. allocate temp buffer for inner object to store field_id and offsets
  2. A pooled buffer for built strings
  3. A ordered set for sorted keys, or a unordered vector for unsorted keys
  4. final buffer for output

The buffer handling would be a place to consider here:

  1. For output, we can have two mode (1) std::string finish() for output (2) append to output buffer
  2. For inner buffer, we can use Arena style management rule

Reference

[1] #46372
[2] https://github.com/apache/parquet-java/blob/1f1e07bbf750fba228851c2d63470c3da5726831/parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java#L31
[3] https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/parquet/src/main/java/org/apache/iceberg/parquet/VariantWriterBuilder.java#L46
[4] apache/arrow-rs#7424
[5] https://github.com/arangodb/velocypack

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions