Skip to content

feat(parquet): add variant encoder/decoder #344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 28, 2025

Conversation

sfc-gh-mbojanczyk
Copy link
Contributor

Rationale for this change

This adds a basic Variant encoder/decoder to start the process of supporting the new Variant encoding spec in the Apache Go Parquet library. Variants are useful for efficiently storing and accessing data, especially in things like Iceberg tables.

What changes are included in this PR?

This adds logic to encode and decode Variants, but does not yet plumb that logic through to either Arrow or Parquet. The PR's getting beefy as is, and this seems to be a good standalone unit to get feedback on.

Still to implement are the handling of decimal primitives.

For ease of implementation, the Metadata keys are only stored in unsorted order. This makes the creation of an encoded Variant simpler as one can serialize data as its being added. For sorted Metadata keys to work, you'd need to buffer data and only create objects at the very end so that the appropriate width of indicies can be chosen.

Are these changes tested?

There are unit tests throughout to test that marshaling produces the expected binary output as per the spec, and to ensure that unmarshaling can spit out the expected values. There are many levels of unit tests, from testing individual marshaling bits to testing the marshaling and unmarshaling of entire Variants.

Are there any user-facing changes?

With this PR, no. This is simply a library to create Variants, but does not plumb the output into Parquet or Arrow.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an initial pass on this and left a bunch of comments. I'll try to take another look later on

Comment on lines 76 to 81
buf := make([]byte, size)
for i := range size {
buf[i] = byte(val)
val >>= 8
}
w.Write(buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the encoding/binary functions to do this instead? such as binary.LittleEndian.Put* or binary.LittleEndian.Append* etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment in readUint(), the binary package only provides Put/Append functionality for uint{16,32,64}, but we need it for the whole range from 1-8 bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use binary.Encode(buf, binary.LittleEndian, val) which will handle the whole range for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar vein to Decode()- this will work for the powers-of-two widths, but if we've got a width that's not (ie. 3) we'd encode with additional padding since we've got to encode to the next power of two width.

This isn't the end of the world (encoding over the minimal necessary width is within spec), so I have fewer objections here. I'd argue with Decode() rolling our own is more of a necessity due to not being in control of what comes in (eg. the Java library can encode 3-byte wide numbers, so it's gotta be handled). This keeps the encode and decode logic fairly similar. Also, FWIW, I feel like the intention of the spec authors is to minimize the number of bytes used to encode things, see the existence of Short String even though a primitive String type exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with rolling our own here is that we're going to need to manage the big/little endian logic ourselves then so that this runs properly on big-endian systems.

Comment on lines 535 to 540
func unmarshalUUID(raw []byte, offset int) ([]byte, error) {
if err := checkBounds(raw, offset, offset+17); err != nil {
return nil, err
}
return raw[offset+1 : offset+17], nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use [16]byte or uuid.UUID please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (with uuid.UUID)

@sfc-gh-mbojanczyk
Copy link
Contributor Author

Made an initial pass on this and left a bunch of comments. I'll try to take another look later on

Thanks so much! I'll get to addressing the comments here shortly (was out for a few days)

@sfc-gh-mbojanczyk sfc-gh-mbojanczyk marked this pull request as ready for review April 22, 2025 23:51
Copy link
Contributor Author

@sfc-gh-mbojanczyk sfc-gh-mbojanczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly addressed your comments here, added some context for the others. The biggest diff IMO is using uuid.UUID now, which definitely simplifies some stuff around there.

I've also moved this review out of draft. Figure it's in pretty decent reviewable shape as-is.

Thanks for the first pass- looking forward to polishing this up!

Comment on lines 221 to 222
if kind == reflect.Interface {
dest.Set(reflect.ValueOf(iv))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be me holding reflect incorrectly, but it is indeed to catch the case that it's unmarshaling into the empty interface. I need to do this to be able to support unmarshaling into a map (ie. map[string]any)

However, I'm sort of flummoxed on the best/canonical way to do this. As far as I can tell, checking kind == reflect.Interface is about as close as I'm going to get (though I guess I can also check that dest.NumMethod() == 0, which the JSON library appears to do.

Let me add that as a check (adding isEmptyInterface as a boolean up top here) to try to narrow that down a bit.

case primitiveNull:
dest.Set(reflect.Zero(dest.Type()))
case primitiveTrue, primitiveFalse:
if kind != reflect.Bool && kind != reflect.Interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as discussion below- this is to handle unmarshaling into an empty interface so this package can unmarshal into map[string]any

Comment on lines 165 to 169
case string:
if allOpts&MarshalAsUUID != 0 {
return marshalUUID([]byte(val), w), nil
}
return marshalString(val, w), nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We most definitely should- silly me, I didn't even think to see if there was a uuid library out there.

Comment on lines 353 to 363
bytes, err := unmarshalUUID(raw, offset)
if err != nil {
return err
}
if kind == reflect.Slice && dest.Type().Elem().Kind() == reflect.Uint8 || kind == reflect.Interface {
dest.Set(reflect.ValueOf(bytes))
} else if kind == reflect.String {
dest.Set(reflect.ValueOf(string(bytes)))
} else {
return fmt.Errorf("cannot decode Variant UUID into dest %s", kind)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to handle both uuid.UUID and arrays

Comment on lines 535 to 540
func unmarshalUUID(raw []byte, offset int) ([]byte, error) {
if err := checkBounds(raw, offset, offset+17); err != nil {
return nil, err
}
return raw[offset+1 : offset+17], nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (with uuid.UUID)

@zeroshade
Copy link
Member

@sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through multiple iterations and discussions with people, plus pulling inspiration from the C++, Spark, and parquet-java implementations to figure out a good design. Please let me know what you think of the updated version! Thanks!

@sfc-gh-mbojanczyk
Copy link
Contributor Author

@sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through multiple iterations and discussions with people, plus pulling inspiration from the C++, Spark, and parquet-java implementations to figure out a good design. Please let me know what you think of the updated version! Thanks!

No worries- I got caught in a whirlwind over here myself and was about to dust this off too :) Lemme take a peek here after the long weekend.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, this is very clean and straightforward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More out of curiosity but do we have fuzzing set up for variants like we do for Parquet in general?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it worth testing examples of invalid variants too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the Go implementation doesn't have any fuzzing set up. Go has a whole infrastructure for setting up fuzz testing (https://go.dev/doc/security/fuzz/) I just haven't gotten around to setting it up. It just hasn't been on a high priority given everything else unless people think I should prioritize it.

That said, I think it makes sense to test some examples of invalid variants. I'll add some tests for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I wonder if fuzzing the variants specifically would be a more manageable case to start with.

zeroshade and others added 2 commits May 24, 2025 13:36
Co-authored-by: David Li <li.davidm96@gmail.com>
@zeroshade
Copy link
Member

I'll give @sfc-gh-mbojanczyk a chance to comment and respond here before I merge this just to make sure I get all the feedback I can.

Copy link
Contributor Author

@sfc-gh-mbojanczyk sfc-gh-mbojanczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, definitely a big change from where we started here :) On the whole this looks pretty clean and more performance oriented than my first stab, and this looks good to me overall.

Left a few comments- most are minor, but my biggest concern is keeping the fields for an object outside of the builder. I recognize that it's for an optimized path, but I feel like having a user keep multiple things in flight leaks an implementation detail that could have been hidden away with a different abstraction.

type PrimitiveType int

const (
PrimitiveInvalid PrimitiveType = iota - 1 // Unknown
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally make the zero value invalid, that way you don't accidentally allow for an uninitialized value to do something wonky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants have specific values defined by the spec, so the zero value needs to be Null as the constant for "Null" type is 0. I don't really have wiggle room to change that. Though, technically if all variant building is done through the builder, we could probably get away without exporting this enum and these constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right! I think I ended up explicitly defining each primitive value (ie. primitiveNull = 0, primitiveFalse = 1, etc...) because I've been bitten by relying on iota in the past (a well meaning coworker reordered my enum to be in alphabetical order and a bunch of tests blew up). IMO, if it's defined in spec, it's safer to be explicit instead of using an enum.

offsetSizeBitShift uint8 = 6
supportedVersion = 1
maxShortStringSize = 0x3F
maxSizeLimit = 128 * 1024 * 1024 // 128MB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Call this metadataMaxSize or something similar (looks like it only applies to the metadata section)

Also, is there actually a max size for the metadata? You can have ~4.29B entries max per the spec- not saying that you should, but still, it feels conceivable that you can have a valid metadata that's beyond 128MB.

@sfc-gh-mbojanczyk
Copy link
Contributor Author

Can't actually toss in my approval since I'm the original author, but this PR looks good to me.

@zeroshade zeroshade merged commit e8c83b4 into apache:main May 28, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants