feat(manifests): Adding implementation of manifest files #3

zeroshade · 2023-08-21T20:46:23Z

Adds an implementation for Manifest Lists, Manifest Entries, and Data Files along with interfaces for wrapping file system IO handling.

zeroshade · 2023-08-21T20:47:39Z

CC @coded9 @Fokko @nastra @bitsondatadev @rdblue

zeroshade · 2023-08-28T23:24:09Z

@coded9 @Fokko @nastra @rdblue any chance for a review soon to get this in?

rdblue · 2023-09-07T15:59:07Z

@zeroshade, sorry for the delay! I'll find some time for reviews over here.

io/io.go

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

zeroshade · 2023-09-14T16:27:19Z

Rebased with the merge for the partitioning stuff, so should be all good here

nastra · 2023-09-15T07:54:58Z

io/s3.go

+	}
+
+	if defaultRegion, ok := props[S3Region]; ok {
+		opts = append(opts, config.WithDefaultRegion(defaultRegion))


I'm not entirely sure, but I think we might rather want to configure the region here instead of the default region using config.WithRegion(). Also we might want to rename defaultRegion to region to make a clear distinction that S3Region stands for the region (not the default region)

so in this situation since we're not really exposing the ability for consumers to set their own options explicitly, I think it ends up being basically identical if we use WithDefaultRegion vs WithRegion as the only real difference between them is that any WithDefaultRegion will always be superceded by any WithRegion options in the config.

By the same token since it wouldn't have any real difference, I agree that it makes sense from a code perspective for me to switch these to use WithRegion rather than WithDefaultRegion.

io/s3.go

nastra · 2023-09-15T08:02:52Z

io/s3.go

@@ -0,0 +1,108 @@
+// Licensed to the Apache Software Foundation (ASF) under one


overall this file LGTM. what would be great is to have some tests that use S3 against a minio container, which can be done in a follow-up PR

Yea I was planning on doing that in a follow-up PR explicitly. I have the test stuff worked out already and, in conjunction with the iceberg REST API catalog impl I was able to replicate a similar integration test to what pyiceberg uses 😄

nastra · 2023-09-15T08:18:31Z

manifest.go

+	Path               string          `avro:"manifest_path"`
+	Len                int64           `avro:"manifest_length"`
+	PartitionSpecID    int32           `avro:"partition_spec_id"`
+	Content            ManifestContent `avro:"content"`


content isn't part of a v1 manifest file. See also https://github.com/apache/iceberg/blob/9907a97351138e676d08e02ea53b746f4e331ec6/core/src/main/java/org/apache/iceberg/V1Metadata.java#L33

So i had a question here: according to https://iceberg.apache.org/spec/#manifests the added_snapshot_id is required in both V1 and V2, but pyiceberg's definition of the manifest file schema (here: https://github.com/apache/iceberg/blob/master/python/pyiceberg/manifest.py#L275) marks it as optional. Which should I follow here?

it seems to be optional in the Java code as well: https://github.com/apache/iceberg/blob/5e5c6d1849dece373b0d425ebd3ba5c7e98ad0ef/api/src/main/java/org/apache/iceberg/ManifestFile.java#L52C1-L53

so it seems that the Spec itself for v1 should have added_snapshot_id as optional

I've opened apache/iceberg#8600 to fix the Spec

@zeroshade just FYI that added_snapshot_id is mandatory after all for V1 and V2, so you might want to follow-up and make it required

sure thing! i'll make a PR for that, thanks for the heads up

manifest.go

nastra · 2023-09-15T08:45:43Z

manifest.go

+	PartitionData    map[string]any         `avro:"partition"`
+	RecordCount      int64                  `avro:"record_count"`
+	FileSize         int64                  `avro:"file_size_in_bytes"`
+	BlockSizeInBytes int64                  `avro:"block_size_in_bytes"`


I think this shouldn't be written for V2

Since the fields don't exist in the defined schema for V2 they shouldn't get written when writing the file since we write according to the schema

nastra · 2023-09-15T08:46:20Z

manifest.go

+}
+
+type dataFile struct {
+	Content          ManifestEntryContent   `avro:"content"`


this only exists for V2 but not for V1

Is it a bad thing for it to get written for a v1 file? (readers would just ignore the field, right?)

readers would/should ignore it. Would it cause too much trouble to not write this field?

In theory, it potentially would avoid writing this field for V1 since it isn't in the schema, but I'll double check

Confirmed that if i start with a datafile containing some non-default value for Content the round trip test comes back with the default being what the struct has. so it doesn't get written and won't be read back out since the V1 schemas don't have the field. updated the unit test accordingly

great, thanks for confirming

manifest.go

nastra · 2023-09-15T08:52:55Z

manifest.go

+	// SortOrderID returns the id representing the sort order for this
+	// file, or nil if there is no sort order.
+	SortOrderID() *int
+}


does this need DataSequenceNumber() and FileSequenceNumber()? See also https://github.com/apache/iceberg/blob/ebce8538db20fd13859b6af841cf433d9423b53c/api/src/main/java/org/apache/iceberg/ContentFile.java#L130-L147

Hmm, i'm reading through the Java code and the best I can see is that the ContentFile interface wraps around the entire ManifestEntry objects and essentially surfaces the individual fields of the data file to the same level as the manifest entry information (which includes the DataSequenceNumber and FileSequenceNumber. But if you look at the spec (https://iceberg.apache.org/spec/#manifests) those values exist in the manifest_entry struct, not the data_file struct, which is why they aren't exposed here at this level.

Looking at the Builder for data files in the Java code (https://github.com/apache/iceberg/blob/d6bc248adb67de74e31dcb9c0af43fef68853d59/core/src/main/java/org/apache/iceberg/DataFiles.java) it also doesn't allow setting the sequence number values down there. You can see both the Data Sequence number and File Sequence Number in the ManifestEntry interface in the Go code.

nastra · 2023-09-15T09:20:44Z

@zeroshade I did a first pass around the manifest schema and code and left a few comments. I still need to review the APIs for IO and the tests

zeroshade · 2023-09-18T16:45:58Z

@nastra I've implemented the suggested fixes but I had a few questions that I posed when you get a chance. Thanks again!

internal/avro_schemas.go

nastra

LGTM, thanks @zeroshade

nastra reviewed Sep 14, 2023

View reviewed changes

io/io.go Outdated Show resolved Hide resolved

nastra reviewed Sep 14, 2023

View reviewed changes

io/io.go Outdated Show resolved Hide resolved

nastra reviewed Sep 14, 2023

View reviewed changes

io/io.go Outdated Show resolved Hide resolved

zeroshade and others added 5 commits September 14, 2023 12:25

feat(manifest): Implement Manifest files and entries

d6935b3

adding comments

ee0a1ba

Apply suggestions from code review

639c86f

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

fix build

514c72e

shift S3 code to s3.go file

5504e28

zeroshade force-pushed the manifests branch from e99537d to 5504e28 Compare September 14, 2023 16:26