-
Notifications
You must be signed in to change notification settings - Fork 34
feat(parquet): add variant encoder/decoder #344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an initial pass on this and left a bunch of comments. I'll try to take another look later on
parquet/variants/util.go
Outdated
buf := make([]byte, size) | ||
for i := range size { | ||
buf[i] = byte(val) | ||
val >>= 8 | ||
} | ||
w.Write(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the encoding/binary
functions to do this instead? such as binary.LittleEndian.Put*
or binary.LittleEndian.Append*
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment in readUint()
, the binary
package only provides Put/Append functionality for uint{16,32,64}
, but we need it for the whole range from 1-8 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use binary.Encode(buf, binary.LittleEndian, val)
which will handle the whole range for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar vein to Decode()
- this will work for the powers-of-two widths, but if we've got a width that's not (ie. 3) we'd encode with additional padding since we've got to encode to the next power of two width.
This isn't the end of the world (encoding over the minimal necessary width is within spec), so I have fewer objections here. I'd argue with Decode()
rolling our own is more of a necessity due to not being in control of what comes in (eg. the Java library can encode 3-byte wide numbers, so it's gotta be handled). This keeps the encode and decode logic fairly similar. Also, FWIW, I feel like the intention of the spec authors is to minimize the number of bytes used to encode things, see the existence of Short String
even though a primitive String
type exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with rolling our own here is that we're going to need to manage the big/little endian logic ourselves then so that this runs properly on big-endian systems.
parquet/variants/primitive.go
Outdated
func unmarshalUUID(raw []byte, offset int) ([]byte, error) { | ||
if err := checkBounds(raw, offset, offset+17); err != nil { | ||
return nil, err | ||
} | ||
return raw[offset+1 : offset+17], nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use [16]byte
or uuid.UUID
please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (with uuid.UUID
)
Thanks so much! I'll get to addressing the comments here shortly (was out for a few days) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly addressed your comments here, added some context for the others. The biggest diff IMO is using uuid.UUID
now, which definitely simplifies some stuff around there.
I've also moved this review out of draft. Figure it's in pretty decent reviewable shape as-is.
Thanks for the first pass- looking forward to polishing this up!
parquet/variants/primitive.go
Outdated
if kind == reflect.Interface { | ||
dest.Set(reflect.ValueOf(iv)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be me holding reflect
incorrectly, but it is indeed to catch the case that it's unmarshaling into the empty interface. I need to do this to be able to support unmarshaling into a map (ie. map[string]any
)
However, I'm sort of flummoxed on the best/canonical way to do this. As far as I can tell, checking kind == reflect.Interface
is about as close as I'm going to get (though I guess I can also check that dest.NumMethod() == 0
, which the JSON library appears to do.
Let me add that as a check (adding isEmptyInterface
as a boolean up top here) to try to narrow that down a bit.
parquet/variants/primitive.go
Outdated
case primitiveNull: | ||
dest.Set(reflect.Zero(dest.Type())) | ||
case primitiveTrue, primitiveFalse: | ||
if kind != reflect.Bool && kind != reflect.Interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as discussion below- this is to handle unmarshaling into an empty interface so this package can unmarshal into map[string]any
parquet/variants/primitive.go
Outdated
case string: | ||
if allOpts&MarshalAsUUID != 0 { | ||
return marshalUUID([]byte(val), w), nil | ||
} | ||
return marshalString(val, w), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We most definitely should- silly me, I didn't even think to see if there was a uuid library out there.
parquet/variants/primitive.go
Outdated
bytes, err := unmarshalUUID(raw, offset) | ||
if err != nil { | ||
return err | ||
} | ||
if kind == reflect.Slice && dest.Type().Elem().Kind() == reflect.Uint8 || kind == reflect.Interface { | ||
dest.Set(reflect.ValueOf(bytes)) | ||
} else if kind == reflect.String { | ||
dest.Set(reflect.ValueOf(string(bytes))) | ||
} else { | ||
return fmt.Errorf("cannot decode Variant UUID into dest %s", kind) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to handle both uuid.UUID
and arrays
parquet/variants/primitive.go
Outdated
func unmarshalUUID(raw []byte, offset int) ([]byte, error) { | ||
if err := checkBounds(raw, offset, offset+17); err != nil { | ||
return nil, err | ||
} | ||
return raw[offset+1 : offset+17], nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (with uuid.UUID
)
some more cleanup refactor using learnings builder and tests
a4bb018
to
c56d099
Compare
@sfc-gh-mbojanczyk Hey, sorry for the long delay here. I've gone through multiple iterations and discussions with people, plus pulling inspiration from the C++, Spark, and parquet-java implementations to figure out a good design. Please let me know what you think of the updated version! Thanks! |
No worries- I got caught in a whirlwind over here myself and was about to dust this off too :) Lemme take a peek here after the long weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, this is very clean and straightforward
parquet/variant/variant_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More out of curiosity but do we have fuzzing set up for variants like we do for Parquet in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is it worth testing examples of invalid variants too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the Go implementation doesn't have any fuzzing set up. Go has a whole infrastructure for setting up fuzz testing (https://go.dev/doc/security/fuzz/) I just haven't gotten around to setting it up. It just hasn't been on a high priority given everything else unless people think I should prioritize it.
That said, I think it makes sense to test some examples of invalid variants. I'll add some tests for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I wonder if fuzzing the variants specifically would be a more manageable case to start with.
Co-authored-by: David Li <li.davidm96@gmail.com>
I'll give @sfc-gh-mbojanczyk a chance to comment and respond here before I merge this just to make sure I get all the feedback I can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah, definitely a big change from where we started here :) On the whole this looks pretty clean and more performance oriented than my first stab, and this looks good to me overall.
Left a few comments- most are minor, but my biggest concern is keeping the fields for an object outside of the builder. I recognize that it's for an optimized path, but I feel like having a user keep multiple things in flight leaks an implementation detail that could have been hidden away with a different abstraction.
type PrimitiveType int | ||
|
||
const ( | ||
PrimitiveInvalid PrimitiveType = iota - 1 // Unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally make the zero value invalid, that way you don't accidentally allow for an uninitialized value to do something wonky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants have specific values defined by the spec, so the zero value needs to be Null as the constant for "Null" type is 0. I don't really have wiggle room to change that. Though, technically if all variant building is done through the builder, we could probably get away without exporting this enum and these constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right! I think I ended up explicitly defining each primitive value (ie. primitiveNull = 0
, primitiveFalse = 1
, etc...) because I've been bitten by relying on iota
in the past (a well meaning coworker reordered my enum to be in alphabetical order and a bunch of tests blew up). IMO, if it's defined in spec, it's safer to be explicit instead of using an enum.
parquet/variant/variant.go
Outdated
offsetSizeBitShift uint8 = 6 | ||
supportedVersion = 1 | ||
maxShortStringSize = 0x3F | ||
maxSizeLimit = 128 * 1024 * 1024 // 128MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Call this metadataMaxSize
or something similar (looks like it only applies to the metadata section)
Also, is there actually a max size for the metadata? You can have ~4.29B entries max per the spec- not saying that you should, but still, it feels conceivable that you can have a valid metadata that's beyond 128MB.
Can't actually toss in my approval since I'm the original author, but this PR looks good to me. |
Rationale for this change
This adds a basic Variant encoder/decoder to start the process of supporting the new Variant encoding spec in the Apache Go Parquet library. Variants are useful for efficiently storing and accessing data, especially in things like Iceberg tables.
What changes are included in this PR?
This adds logic to encode and decode Variants, but does not yet plumb that logic through to either Arrow or Parquet. The PR's getting beefy as is, and this seems to be a good standalone unit to get feedback on.
Still to implement are the handling of decimal primitives.
For ease of implementation, the Metadata keys are only stored in unsorted order. This makes the creation of an encoded Variant simpler as one can serialize data as its being added. For sorted Metadata keys to work, you'd need to buffer data and only create objects at the very end so that the appropriate width of indicies can be chosen.
Are these changes tested?
There are unit tests throughout to test that marshaling produces the expected binary output as per the spec, and to ensure that unmarshaling can spit out the expected values. There are many levels of unit tests, from testing individual marshaling bits to testing the marshaling and unmarshaling of entire Variants.
Are there any user-facing changes?
With this PR, no. This is simply a library to create Variants, but does not plumb the output into Parquet or Arrow.