-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(manifests): Adding implementation of manifest files #3
Conversation
@zeroshade, sorry for the delay! I'll find some time for reviews over here. |
Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
e99537d
to
5504e28
Compare
Rebased with the merge for the partitioning stuff, so should be all good here |
io/s3.go
Outdated
} | ||
|
||
if defaultRegion, ok := props[S3Region]; ok { | ||
opts = append(opts, config.WithDefaultRegion(defaultRegion)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure, but I think we might rather want to configure the region here instead of the default region using config.WithRegion()
. Also we might want to rename defaultRegion
to region
to make a clear distinction that S3Region
stands for the region (not the default region)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in this situation since we're not really exposing the ability for consumers to set their own options explicitly, I think it ends up being basically identical if we use WithDefaultRegion
vs WithRegion
as the only real difference between them is that any WithDefaultRegion
will always be superceded by any WithRegion
options in the config.
By the same token since it wouldn't have any real difference, I agree that it makes sense from a code perspective for me to switch these to use WithRegion
rather than WithDefaultRegion
.
@@ -0,0 +1,108 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall this file LGTM. what would be great is to have some tests that use S3 against a minio container, which can be done in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I was planning on doing that in a follow-up PR explicitly. I have the test stuff worked out already and, in conjunction with the iceberg REST API catalog impl I was able to replicate a similar integration test to what pyiceberg uses 😄
manifest.go
Outdated
Path string `avro:"manifest_path"` | ||
Len int64 `avro:"manifest_length"` | ||
PartitionSpecID int32 `avro:"partition_spec_id"` | ||
Content ManifestContent `avro:"content"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content
isn't part of a v1 manifest file. See also https://github.com/apache/iceberg/blob/9907a97351138e676d08e02ea53b746f4e331ec6/core/src/main/java/org/apache/iceberg/V1Metadata.java#L33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i had a question here: according to https://iceberg.apache.org/spec/#manifests the added_snapshot_id
is required in both V1 and V2, but pyiceberg's definition of the manifest file schema (here: https://github.com/apache/iceberg/blob/master/python/pyiceberg/manifest.py#L275) marks it as optional. Which should I follow here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to be optional in the Java code as well: https://github.com/apache/iceberg/blob/5e5c6d1849dece373b0d425ebd3ba5c7e98ad0ef/api/src/main/java/org/apache/iceberg/ManifestFile.java#L52C1-L53
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it seems that the Spec itself for v1 should have added_snapshot_id
as optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened apache/iceberg#8600 to fix the Spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroshade just FYI that added_snapshot_id
is mandatory after all for V1 and V2, so you might want to follow-up and make it required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing! i'll make a PR for that, thanks for the heads up
PartitionData map[string]any `avro:"partition"` | ||
RecordCount int64 `avro:"record_count"` | ||
FileSize int64 `avro:"file_size_in_bytes"` | ||
BlockSizeInBytes int64 `avro:"block_size_in_bytes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this shouldn't be written for V2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the fields don't exist in the defined schema for V2 they shouldn't get written when writing the file since we write according to the schema
} | ||
|
||
type dataFile struct { | ||
Content ManifestEntryContent `avro:"content"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only exists for V2 but not for V1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a bad thing for it to get written for a v1 file? (readers would just ignore the field, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readers would/should ignore it. Would it cause too much trouble to not write this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it potentially would avoid writing this field for V1 since it isn't in the schema, but I'll double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that if i start with a datafile containing some non-default value for Content
the round trip test comes back with the default being what the struct has. so it doesn't get written and won't be read back out since the V1 schemas don't have the field. updated the unit test accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks for confirming
// SortOrderID returns the id representing the sort order for this | ||
// file, or nil if there is no sort order. | ||
SortOrderID() *int | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need DataSequenceNumber()
and FileSequenceNumber()
? See also https://github.com/apache/iceberg/blob/ebce8538db20fd13859b6af841cf433d9423b53c/api/src/main/java/org/apache/iceberg/ContentFile.java#L130-L147
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, i'm reading through the Java code and the best I can see is that the ContentFile
interface wraps around the entire ManifestEntry
objects and essentially surfaces the individual fields of the data file to the same level as the manifest entry information (which includes the DataSequenceNumber
and FileSequenceNumber
. But if you look at the spec (https://iceberg.apache.org/spec/#manifests) those values exist in the manifest_entry
struct, not the data_file
struct, which is why they aren't exposed here at this level.
Looking at the Builder for data files in the Java code (https://github.com/apache/iceberg/blob/d6bc248adb67de74e31dcb9c0af43fef68853d59/core/src/main/java/org/apache/iceberg/DataFiles.java) it also doesn't allow setting the sequence number values down there. You can see both the Data Sequence number and File Sequence Number in the ManifestEntry
interface in the Go code.
@zeroshade I did a first pass around the manifest schema and code and left a few comments. I still need to review the APIs for IO and the tests |
@nastra I've implemented the suggested fixes but I had a few questions that I posed when you get a chance. Thanks again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @zeroshade
Adds an implementation for Manifest Lists, Manifest Entries, and Data Files along with interfaces for wrapping file system IO handling.