Skip to content

Commit e8c83b4

Browse files
sfc-gh-mbojanczykzeroshadelidavidm
authored
feat(parquet): add variant encoder/decoder (#344)
closes #349 closes #350 ### Rationale for this change This adds a basic Variant encoder/decoder to start the process of supporting the new [Variant encoding spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) in the Apache Go Parquet library. Variants are useful for efficiently storing and accessing data, especially in things like Iceberg tables. ### What changes are included in this PR? This adds logic to encode and decode Variants, but does not yet plumb that logic through to either Arrow or Parquet. The PR's getting beefy as is, and this seems to be a good standalone unit to get feedback on. Still to implement are the handling of decimal primitives. For ease of implementation, the Metadata keys are only stored in unsorted order. This makes the creation of an encoded Variant simpler as one can serialize data as its being added. For sorted Metadata keys to work, you'd need to buffer data and only create objects at the very end so that the appropriate width of indicies can be chosen. ### Are these changes tested? There are unit tests throughout to test that marshaling produces the expected binary output as per the spec, and to ensure that unmarshaling can spit out the expected values. There are many levels of unit tests, from testing individual marshaling bits to testing the marshaling and unmarshaling of entire Variants. ### Are there any user-facing changes? With this PR, no. This is simply a library to create Variants, but does not plumb the output into Parquet or Arrow. --------- Co-authored-by: Matt Topol <zeroshade@apache.org> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com>
1 parent d3850c3 commit e8c83b4

15 files changed

+3559
-10
lines changed

arrow/endian/big.go

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,22 @@
1919

2020
package endian
2121

22-
import "encoding/binary"
23-
24-
var Native = binary.BigEndian
22+
import "math/bits"
2523

2624
const (
2725
IsBigEndian = true
2826
NativeEndian = BigEndian
2927
NonNativeEndian = LittleEndian
3028
)
29+
30+
func FromLE[T uint16 | uint32 | uint64](x T) T {
31+
switch v := any(x).(type) {
32+
case uint16:
33+
return T(bits.Reverse16(v))
34+
case uint32:
35+
return T(bits.Reverse32(v))
36+
case uint64:
37+
return T(bits.Reverse64(v))
38+
}
39+
return x
40+
}

arrow/endian/endian.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,14 @@
1717
package endian
1818

1919
import (
20+
"encoding/binary"
21+
2022
"github.com/apache/arrow-go/v18/arrow/internal/debug"
2123
"github.com/apache/arrow-go/v18/arrow/internal/flatbuf"
2224
)
2325

26+
var Native = binary.NativeEndian
27+
2428
type Endianness flatbuf.Endianness
2529

2630
const (

arrow/endian/little.go

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@
1919

2020
package endian
2121

22-
import "encoding/binary"
23-
24-
var Native = binary.LittleEndian
25-
2622
const (
2723
IsBigEndian = false
2824
NativeEndian = LittleEndian
2925
NonNativeEndian = BigEndian
3026
)
27+
28+
func FromLE[T uint16 | uint32 | uint64](x T) T {
29+
return x
30+
}

dev/release/rat_exclude_files.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,5 @@ parquet/internal/gen-go/parquet/GoUnusedProtection__.go
3434
parquet/internal/gen-go/parquet/parquet-consts.go
3535
parquet/internal/gen-go/parquet/parquet.go
3636
parquet/version_string.go
37+
parquet/variant/basic_type_stringer.go
38+
parquet/variant/primitive_type_stringer.go

go.mod

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ require (
6666
github.com/dustin/go-humanize v1.0.1 // indirect
6767
github.com/fatih/color v1.15.0 // indirect
6868
github.com/goccy/go-yaml v1.11.0 // indirect
69+
github.com/google/go-cmp v0.7.0 // indirect
6970
github.com/gookit/color v1.5.4 // indirect
7071
github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect
7172
github.com/json-iterator/go v1.1.12 // indirect
@@ -100,3 +101,4 @@ require (
100101
modernc.org/strutil v1.2.0 // indirect
101102
modernc.org/token v1.1.0 // indirect
102103
)
104+

go.sum

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,8 @@ github.com/golang/snappy v1.0.0 h1:Oy607GVXHs7RtbggtPBnr2RmDArIsAefDwvrdWvRhGs=
6262
github.com/golang/snappy v1.0.0/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
6363
github.com/google/flatbuffers v25.2.10+incompatible h1:F3vclr7C3HpB1k9mxCGRMXq6FdUalZ6H/pNX4FP1v0Q=
6464
github.com/google/flatbuffers v25.2.10+incompatible/go.mod h1:1AeVuKshWv4vARoZatz6mlQ0JxURH0Kv5+zNeJKJCa8=
65-
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
66-
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
65+
github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
66+
github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
6767
github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
6868
github.com/google/pprof v0.0.0-20221118152302-e6195bd50e26 h1:Xim43kblpZXfIBQsbuBVKCudVG457BR2GZFIz3uw3hQ=
6969
github.com/google/pprof v0.0.0-20221118152302-e6195bd50e26/go.mod h1:dDKJzRmX4S37WGHujM7tX//fmj1uioxKzKxz3lo4HJo=

parquet-testing

Submodule parquet-testing updated 71 files

parquet/variant/basic_type_stringer.go

Lines changed: 28 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)