feat(parquet): add variant encoder/decoder #344

sfc-gh-mbojanczyk · 2025-04-05T03:48:26Z

Rationale for this change

This adds a basic Variant encoder/decoder to start the process of supporting the new Variant encoding spec in the Apache Go Parquet library. Variants are useful for efficiently storing and accessing data, especially in things like Iceberg tables.

What changes are included in this PR?

This adds logic to encode and decode Variants, but does not yet plumb that logic through to either Arrow or Parquet. The PR's getting beefy as is, and this seems to be a good standalone unit to get feedback on.

Still to implement are the handling of decimal primitives.

For ease of implementation, the Metadata keys are only stored in unsorted order. This makes the creation of an encoded Variant simpler as one can serialize data as its being added. For sorted Metadata keys to work, you'd need to buffer data and only create objects at the very end so that the appropriate width of indicies can be chosen.

Are these changes tested?

There are unit tests throughout to test that marshaling produces the expected binary output as per the spec, and to ensure that unmarshaling can spit out the expected values. There are many levels of unit tests, from testing individual marshaling bits to testing the marshaling and unmarshaling of entire Variants.

Are there any user-facing changes?

With this PR, no. This is simply a library to create Variants, but does not plumb the output into Parquet or Arrow.

zeroshade

Made an initial pass on this and left a bunch of comments. I'll try to take another look later on

parquet/variants/util.go

zeroshade · 2025-04-11T22:13:02Z

parquet/variants/util.go

+	buf := make([]byte, size)
+	for i := range size {
+		buf[i] = byte(val)
+		val >>= 8
+	}
+	w.Write(buf)


why not use the encoding/binary functions to do this instead? such as binary.LittleEndian.Put* or binary.LittleEndian.Append* etc.

Similar to the comment in readUint(), the binary package only provides Put/Append functionality for uint{16,32,64}, but we need it for the whole range from 1-8 bytes.

you can use binary.Encode(buf, binary.LittleEndian, val) which will handle the whole range for you

Similar vein to Decode()- this will work for the powers-of-two widths, but if we've got a width that's not (ie. 3) we'd encode with additional padding since we've got to encode to the next power of two width.

This isn't the end of the world (encoding over the minimal necessary width is within spec), so I have fewer objections here. I'd argue with Decode() rolling our own is more of a necessity due to not being in control of what comes in (eg. the Java library can encode 3-byte wide numbers, so it's gotta be handled). This keeps the encode and decode logic fairly similar. Also, FWIW, I feel like the intention of the spec authors is to minimize the number of bytes used to encode things, see the existence of Short String even though a primitive String type exists.

The problem with rolling our own here is that we're going to need to manage the big/little endian logic ourselves then so that this runs properly on big-endian systems.

parquet/variants/util.go

parquet/variants/primitive.go

zeroshade · 2025-04-11T22:33:03Z

parquet/variants/primitive.go

+func unmarshalUUID(raw []byte, offset int) ([]byte, error) {
+	if err := checkBounds(raw, offset, offset+17); err != nil {
+		return nil, err
+	}
+	return raw[offset+1 : offset+17], nil
+}


use [16]byte or uuid.UUID please

Done (with uuid.UUID)

parquet/variants/primitive.go

sfc-gh-mbojanczyk · 2025-04-16T22:54:02Z

Made an initial pass on this and left a bunch of comments. I'll try to take another look later on

Thanks so much! I'll get to addressing the comments here shortly (was out for a few days)

sfc-gh-mbojanczyk

Mostly addressed your comments here, added some context for the others. The biggest diff IMO is using uuid.UUID now, which definitely simplifies some stuff around there.

I've also moved this review out of draft. Figure it's in pretty decent reviewable shape as-is.

Thanks for the first pass- looking forward to polishing this up!

parquet/variants/primitive.go

sfc-gh-mbojanczyk · 2025-04-22T21:20:56Z

parquet/variants/primitive.go

+		if kind == reflect.Interface {
+			dest.Set(reflect.ValueOf(iv))


This could be me holding reflect incorrectly, but it is indeed to catch the case that it's unmarshaling into the empty interface. I need to do this to be able to support unmarshaling into a map (ie. map[string]any)

However, I'm sort of flummoxed on the best/canonical way to do this. As far as I can tell, checking kind == reflect.Interface is about as close as I'm going to get (though I guess I can also check that dest.NumMethod() == 0, which the JSON library appears to do.

Let me add that as a check (adding isEmptyInterface as a boolean up top here) to try to narrow that down a bit.

sfc-gh-mbojanczyk · 2025-04-22T21:21:35Z

parquet/variants/primitive.go

+	case primitiveNull:
+		dest.Set(reflect.Zero(dest.Type()))
+	case primitiveTrue, primitiveFalse:
+		if kind != reflect.Bool && kind != reflect.Interface {


Same as discussion below- this is to handle unmarshaling into an empty interface so this package can unmarshal into map[string]any

sfc-gh-mbojanczyk · 2025-04-22T21:28:25Z

parquet/variants/primitive.go

+	case string:
+		if allOpts&MarshalAsUUID != 0 {
+			return marshalUUID([]byte(val), w), nil
+		}
+		return marshalString(val, w), nil


We most definitely should- silly me, I didn't even think to see if there was a uuid library out there.

sfc-gh-mbojanczyk · 2025-04-22T22:02:31Z

parquet/variants/primitive.go

+		bytes, err := unmarshalUUID(raw, offset)
+		if err != nil {
+			return err
+		}
+		if kind == reflect.Slice && dest.Type().Elem().Kind() == reflect.Uint8 || kind == reflect.Interface {
+			dest.Set(reflect.ValueOf(bytes))
+		} else if kind == reflect.String {
+			dest.Set(reflect.ValueOf(string(bytes)))
+		} else {
+			return fmt.Errorf("cannot decode Variant UUID into dest %s", kind)
+		}


Updated to handle both uuid.UUID and arrays

sfc-gh-mbojanczyk · 2025-04-22T22:23:18Z

parquet/variants/primitive.go

+func unmarshalUUID(raw []byte, offset int) ([]byte, error) {
+	if err := checkBounds(raw, offset, offset+17); err != nil {
+		return nil, err
+	}
+	return raw[offset+1 : offset+17], nil
+}


Done (with uuid.UUID)

parquet/variants/primitive.go

parquet/variants/util.go

parquet/variants/primitive.go

some more cleanup refactor using learnings builder and tests

zeroshade · 2025-05-23T19:49:07Z

@sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through multiple iterations and discussions with people, plus pulling inspiration from the C++, Spark, and parquet-java implementations to figure out a good design. Please let me know what you think of the updated version! Thanks!

sfc-gh-mbojanczyk · 2025-05-23T23:45:37Z

@sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through multiple iterations and discussions with people, plus pulling inspiration from the C++, Spark, and parquet-java implementations to figure out a good design. Please let me know what you think of the updated version! Thanks!

No worries- I got caught in a whirlwind over here myself and was about to dust this off too :) Lemme take a peek here after the long weekend.

lidavidm

I like this, this is very clean and straightforward

parquet/variant/utils.go

parquet/variant/variant.go

lidavidm · 2025-05-24T06:08:05Z

parquet/variant/variant_test.go

More out of curiosity but do we have fuzzing set up for variants like we do for Parquet in general?

Also is it worth testing examples of invalid variants too?

Currently the Go implementation doesn't have any fuzzing set up. Go has a whole infrastructure for setting up fuzz testing (https://go.dev/doc/security/fuzz/) I just haven't gotten around to setting it up. It just hasn't been on a high priority given everything else unless people think I should prioritize it.

That said, I think it makes sense to test some examples of invalid variants. I'll add some tests for that.

Thanks. I wonder if fuzzing the variants specifically would be a more manageable case to start with.

parquet/variant/builder.go

Co-authored-by: David Li <li.davidm96@gmail.com>

zeroshade · 2025-05-27T15:52:23Z

I'll give @sfc-gh-mbojanczyk a chance to comment and respond here before I merge this just to make sure I get all the feedback I can.

sfc-gh-mbojanczyk

Woah, definitely a big change from where we started here :) On the whole this looks pretty clean and more performance oriented than my first stab, and this looks good to me overall.

Left a few comments- most are minor, but my biggest concern is keeping the fields for an object outside of the builder. I recognize that it's for an optimized path, but I feel like having a user keep multiple things in flight leaks an implementation detail that could have been hidden away with a different abstraction.

sfc-gh-mbojanczyk · 2025-05-27T17:49:33Z

parquet/variant/variant.go

+type PrimitiveType int
+
+const (
+	PrimitiveInvalid            PrimitiveType = iota - 1 // Unknown


I generally make the zero value invalid, that way you don't accidentally allow for an uninitialized value to do something wonky.

These constants have specific values defined by the spec, so the zero value needs to be Null as the constant for "Null" type is 0. I don't really have wiggle room to change that. Though, technically if all variant building is done through the builder, we could probably get away without exporting this enum and these constants.

Oh right! I think I ended up explicitly defining each primitive value (ie. primitiveNull = 0, primitiveFalse = 1, etc...) because I've been bitten by relying on iota in the past (a well meaning coworker reordered my enum to be in alphabetical order and a bunch of tests blew up). IMO, if it's defined in spec, it's safer to be explicit instead of using an enum.

sfc-gh-mbojanczyk · 2025-05-27T17:55:43Z

parquet/variant/variant.go

+	offsetSizeBitShift uint8 = 6
+	supportedVersion         = 1
+	maxShortStringSize       = 0x3F
+	maxSizeLimit             = 128 * 1024 * 1024 // 128MB


Nit: Call this metadataMaxSize or something similar (looks like it only applies to the metadata section)

Also, is there actually a max size for the metadata? You can have ~4.29B entries max per the spec- not saying that you should, but still, it feels conceivable that you can have a valid metadata that's beyond 128MB.

parquet/variant/builder.go

parquet/variant/doc.go

parquet/variant/basic_type_string.go

sfc-gh-mbojanczyk · 2025-05-28T18:04:15Z

Can't actually toss in my approval since I'm the original author, but this PR looks good to me.

sfc-gh-mbojanczyk mentioned this pull request Apr 5, 2025

[Parquet] Support Variant Encoding for Parquet #310

Open

zeroshade reviewed Apr 11, 2025

View reviewed changes

sfc-gh-mbojanczyk marked this pull request as ready for review April 22, 2025 23:51

sfc-gh-mbojanczyk commented Apr 22, 2025

View reviewed changes

zeroshade reviewed Apr 28, 2025

View reviewed changes

parquet/variants/primitive.go Outdated Show resolved Hide resolved

sfc-gh-mbojanczyk and others added 4 commits May 6, 2025 13:53

feat(parquet): add variant encoder/decoder

df3a5d4

Address feedback

ec026ea

some generic cleanups

8499334

some more cleanup refactor using learnings builder and tests

refactor and redesign. add docs

c56d099

zeroshade force-pushed the variant-support branch from a4bb018 to c56d099 Compare May 23, 2025 19:39

zeroshade added 2 commits May 23, 2025 15:42

go mod tidy

616e71e

exclude generated files from RAT

71ea4d6

zeroshade requested review from kou, raulcd and assignUser as code owners May 23, 2025 19:44

zeroshade mentioned this pull request May 23, 2025

[C++][Parquet] Encoding tools for variant type apache/arrow#46555

Open

lidavidm approved these changes May 24, 2025

View reviewed changes

zeroshade and others added 2 commits May 24, 2025 13:36

Update parquet/variant/variant.go

ee7951a

Co-authored-by: David Li <li.davidm96@gmail.com>

updates and tests from feedback

18fe8d5

lidavidm approved these changes May 25, 2025

View reviewed changes

sfc-gh-mbojanczyk commented May 27, 2025

View reviewed changes

alamb mentioned this pull request May 27, 2025

Variant: Rust API to Create Variant Values apache/arrow-rs#7424

Open

zeroshade added 2 commits May 27, 2025 16:02

rename generated string files

f976e80

updates from feedback

70b7f90

zeroshade approved these changes May 28, 2025

View reviewed changes

zeroshade merged commit e8c83b4 into apache:main May 28, 2025
23 checks passed

feat(parquet): add variant encoder/decoder #344

feat(parquet): add variant encoder/decoder #344

Uh oh!

Conversation

sfc-gh-mbojanczyk commented Apr 5, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfc-gh-mbojanczyk commented Apr 16, 2025

Uh oh!

sfc-gh-mbojanczyk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeroshade commented May 23, 2025

Uh oh!

sfc-gh-mbojanczyk commented May 23, 2025

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zeroshade commented May 27, 2025

Uh oh!

sfc-gh-mbojanczyk left a comment

Choose a reason for hiding this comment