Core: add schema id to snapshot #2275

yyanyy · 2021-02-26T03:30:12Z

Followup of Core: add schema id and schemas to table metadata #2096 to add schema id to snapshot.
Also addressed the remaining comments in Core: add schema id and schemas to table metadata #2096 in this commit
Will added some util methods for easier retrieving of snapshot/schema information in follow up PRs.

wypoon · 2021-03-08T18:06:33Z

I have some questions:
When we switch from v1 format to v2, and a new metadata file is written for an existing table, what schemas are written to the schemas list? And in the snapshot-log, what schema-id is written for the previous snapshots? (Is it not written, i.e., is null? or is it 0?)
In general, if we see a schema id of 0, does that ever represent a specific schema, or does that always represent some undetermined schema? Let me elaborate: (1) Will we ever see a schema-id of 0 in a metadata file and if so, does that refer to a unique schema? (2) In code, if we have an instance of a schema and its schemaId is 0, what are the semantics of that schemaId?

wypoon · 2021-03-08T19:20:30Z

core/src/test/java/org/apache/iceberg/TestSchemaID.java

+    assertSameSchemaMap(ImmutableMap.of(0, oldSchema, 1, updatedSchema), table.schemas());
+  }
+
+  private void validateSnapshotsAndHistoryEntries(int numElement) {


You could have this helper method take a List<Integer> containing the schema ids instead of just an int (number of entries). Then testSchemaIdChangeInSchemaUpdate would be able to use this helper method for the cases after updating the schema. snapshots.size() should then match the List's size, and the snapshot.schemaId() in each snapshot in snapshots should match the corresponding Integer in the List.

Great suggestion, this makes the code in this class much easier to read; thank you!

wypoon · 2021-03-08T19:29:16Z

core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java

 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;

 public class SnapshotUtil {


I don't see the static methods added here, snapshotIdFromTime and schemaOfSnapshot, used anywhere in this change. If that is the case, this file can be excluded from this particular PR.

sounds good, will remove and create a separate PR for that.

wypoon · 2021-03-08T19:38:09Z

core/src/test/resources/TableMetadataV2Valid.json

+    {
+      "snapshot-id": 3051729675574597004,
+      "timestamp-ms": 1515100955770
+    },


This suggests to me that not all snapshots in the snapshot-log need to have a schema-id. Is this what happens if the snapshot was written when the format was v1?

yes, we couldn't get the past schema/id for old snapshots, so I think this is the time when we need to go back to your PR's change to figure out the correct id.

yyanyy · 2021-03-13T02:39:10Z

I have some questions:
When we switch from v1 format to v2, and a new metadata file is written for an existing table, what schemas are written to the schemas list? And in the snapshot-log, what schema-id is written for the previous snapshots? (Is it not written, i.e., is null? or is it 0?)
In general, if we see a schema id of 0, does that ever represent a specific schema, or does that always represent some undetermined schema? Let me elaborate: (1) Will we ever see a schema-id of 0 in a metadata file and if so, does that refer to a unique schema? (2) In code, if we have an instance of a schema and its schemaId is 0, what are the semantics of that schemaId?

Thank you for the review, and sorry for the delay responding!

I think this change applies to v1 tables as well. When the engine starts to use a release with this change, the new schemas list will be written with the current schema and 0 as its default schema id. And in snapshot-log, previous snapshots will have null schema-id since they were not available when they were written.

0 is a valid schema-id and it will refer to a unique schema in metadata file; if there's no schema evolution after the table starts to write schemas, 0 will be assigned to the current schema. And in the code, since we only care about id during the interaction with table metadata, and throughout the process when schema class is used as various classes for doing projection etc, schemaId will always be 0, and that is just a default value and shouldn't be used. #2096 has some conversation around this, and this behavior is mentioned in schema class.

rdblue · 2021-03-31T01:40:39Z

core/src/main/java/org/apache/iceberg/TableMetadata.java


-    SnapshotLogEntry(long timestampMillis, long snapshotId) {
+    SnapshotLogEntry(long timestampMillis, long snapshotId, Integer schemaId) {


Why add the schema ID to the snapshot log as well as the snapshot? The snapshots in the log are all available in table metadata, so it doesn't seem like there is a benefit to adding it.

Thank you for the review! I think I originally added it to history so that we can directly query history entries to get both snapshot ID and schema ID when we do time based time travel queries, but later I didn't end up using it for the utility methods I used to have so I think we can drop it. I'll update the PR to reflect this.

The title and description of the PR can be amended to reflect this then.

wypoon · 2021-04-03T23:09:39Z

@yyanyy thank you for responding to my question. But because I am slow, I needed to do some testing to understand for myself the effects of #2096 and this PR.
I started with Iceberg 0.11.0 and created some tables, inserting data into them, and altering the schema. Then I added the commit for #2096 and the changes here on top (actually I added the commit for #2089 first, so that the backports are clean). I made further updates to my Iceberg tables, adding data and altering the schema.
From what I see, when the table is updated and a new metadata.json file for the table is written, the metadata.json file still has a schema field, but the schema now has a schema-id field; it also has a schemas field containing an array of schemas, and a current-schema-id field. The only schema known at the time of the switchover is the current schema of the table, so if the table change is just change to data, the existing schema is given a schema-id of 0, and the schemas array contains a single schema; if the table change is a schema change, then the old schema is given a schema-id of 0, the new schema is given a schema-id of 1, and the schemas array contains two schemas (and the current-schema-id field is 1). In the snapshots field, any snapshot created before the switchover does not have a schema-id; snapshots created after the switchover are given a schema-id, corresponding to the schema current at the time the snapshot is written.
When Snapshot#schemaId() is called, if the snapshot is written before the switchover, null is returned, and if written after the switchover, a non-null Integer is returned.
It is as you wrote.
With this change, I think it is straightforward to update my PR, #1508. I just need to update the following method I'm adding to BaseTable

  public Schema schemaForSnapshot(long snapshotId) {
    TableMetadata current = ops.current();
    // First, check if the snapshot has a schema id associated with it
    Snapshot snapshot = current.snapshot(snapshotId);
    Integer schemaId = snapshot.schemaId();
    if (schemaId != null) {
      return current.schemasById().get(schemaId);
    }
    // Otherwise, read each of the previous metadata files until we find one whose current
    // snapshot id is the snapshot id
    ...
  }

I rebuilt Iceberg with my change on top of the previous changes, and I was able to see that the correct schema is used when viewing any snapshot (time travel).
When this change is merged, I will update my PR. I hope that it can be considered then.

yyanyy · 2021-04-06T06:31:54Z

@yyanyy thank you for responding to my question. But because I am slow, I needed to do some testing to understand for myself the effects of #2096 and this PR.
I started with Iceberg 0.11.0 and created some tables, inserting data into them, and altering the schema. Then I added the commit for #2096 and the changes here on top (actually I added the commit for #2089 first, so that the backports are clean). I made further updates to my Iceberg tables, adding data and altering the schema.
From what I see, when the table is updated and a new metadata.json file for the table is written, the metadata.json file still has a schema field, but the schema now has a schema-id field; it also has a schemas field containing an array of schemas, and a current-schema-id field. The only schema known at the time of the switchover is the current schema of the table, so if the table change is just change to data, the existing schema is given a schema-id of 0, and the schemas array contains a single schema; if the table change is a schema change, then the old schema is given a schema-id of 0, the new schema is given a schema-id of 1, and the schemas array contains two schemas (and the current-schema-id field is 1). In the snapshots field, any snapshot created before the switchover does not have a schema-id; snapshots created after the switchover are given a schema-id, corresponding to the schema current at the time the snapshot is written.
When Snapshot#schemaId() is called, if the snapshot is written before the switchover, null is returned, and if written after the switchover, a non-null Integer is returned.
It is as you wrote.
With this change, I think it is straightforward to update my PR, #1508. I just need to update the following method I'm adding to BaseTable
  public Schema schemaForSnapshot(long snapshotId) {
    TableMetadata current = ops.current();
    // First, check if the snapshot has a schema id associated with it
    Snapshot snapshot = current.snapshot(snapshotId);
    Integer schemaId = snapshot.schemaId();
    if (schemaId != null) {
      return current.schemasById().get(schemaId);
    }
    // Otherwise, read each of the previous metadata files until we find one whose current
    // snapshot id is the snapshot id
    ...
  }
I rebuilt Iceberg with my change on top of the previous changes, and I was able to see that the correct schema is used when viewing any snapshot (time travel).
When this change is merged, I will update my PR. I hope that it can be considered then.

Thank you for spending time verifying the changes, and described the steps here!

openinx · 2021-04-07T03:30:28Z

api/src/main/java/org/apache/iceberg/Snapshot.java

+   * @return schema id associated with this snapshot
+   */
+  default Integer schemaId() {
+    return null;


What's the case that the information will be null ? And if it's null, then how could people read the correct schema for the snapshot ?

Okay, I think you mean if people read the old metadata, its schema id from snapshots will be null.

Yes, schemaId() returns null in the case where the snapshot was written before this change. Note though, that even after this change, new metadata can have snapshots without schema id (so schemaId() for those snapshots will return null), if it is metadata for a table existing before this change.

I have a PR (#1508) that reads previous metadata to get the schema for the snapshot in case Snapshot#schemaId() returns null.

Sorry for the delay in response, and thank you @openinx for the review! And thank you @wypoon for responding!

openinx · 2021-04-07T03:35:10Z

api/src/test/java/org/apache/iceberg/TestHelpers.java

+
+    map1.forEach((schemaId, schema1) -> {
+      Schema schema2 = map2.get(schemaId);
+      Assert.assertNotNull(String.format("Schema ID %s should exist in both map", schemaId), schema2);


Nit: I think we could make this error message more clear here because the given schemaId is definitely not found in the map2 if the assert failure happens.

core/src/main/java/org/apache/iceberg/BaseSnapshot.java

openinx · 2021-04-07T03:44:01Z

core/src/main/java/org/apache/iceberg/SerializableTable.java

@@ -147,6 +147,7 @@ public String location() {
    return properties;
  }

+  // Note that schema parsed from string does not contain the correct schema ID.


What does this mean ?

This is related to the comment I had in #2465 (comment) that we didn't persist schema id within toJson() which is called when we serialize the table.

Honestly for this case I think we can go either way; we can use a different toJson implementation when serialize the table since schema at that time almost guaranteed to be the original schema from table metadata; however since the only usage of schemaId is for time travel queries, and this use case doesn't need id from the current schema itself so adding it isn't necessary, and not having schema Id is not something we don't expect per this note.

I think we should use the correct toJson implementation so that we don't need to note that the ID doesn't match.

This is actually before we decided to write down schema id as part of toJson; now that toJson always writes schema id, this note is no longer relevant and I'll remove it.

openinx · 2021-04-07T03:52:30Z

core/src/main/java/org/apache/iceberg/BaseSnapshot.java

+               long timestampMillis,
+               String operation,
+               Map<String, String> summary,
+               Integer schemaId, List<ManifestFile> dataManifests) {


Nit: Let's use two separate lines for those two constructor variables, that's more clear.

core/src/main/java/org/apache/iceberg/BaseSnapshot.java

rdblue · 2021-05-31T22:53:41Z

core/src/main/java/org/apache/iceberg/TableMetadataParser.java

+    generator.writeEndObject();
+  }
+
+  private static void writeSnapshotRelated(TableMetadata metadata, JsonGenerator generator) throws IOException {


It doesn't look like this refactor is necessary any more. It just creates a method that is only used once. I don't think that this file needs to change at all.

core/src/test/java/org/apache/iceberg/TestSchemaID.java

rdblue · 2021-05-31T23:35:36Z

core/src/test/java/org/apache/iceberg/TestSchemaID.java

+    // update schema
+    table.updateSchema().addColumn("data2", Types.StringType.get()).commit();
+
+    Schema updatedSchema = new Schema(1,


Can't we just get the table's current schema? Or is this intended to check that schema evolution produces the right ID in addition to the snapshot tracking?

Yeah I was hoping to make the check explicitly that schema id is now different

core/src/test/java/org/apache/iceberg/TestSchemaID.java

core/src/test/java/org/apache/iceberg/TestSnapshotJson.java

core/src/test/java/org/apache/iceberg/TestTableMetadata.java

rdblue · 2021-05-31T23:45:18Z

core/src/test/java/org/apache/iceberg/TableTestBase.java

@@ -308,6 +308,8 @@ void validateSnapshot(Snapshot old, Snapshot snap, Long sequenceNumber, DataFile
    }

    Assert.assertFalse("Should find all files in the manifest", newPaths.hasNext());
+
+    Assert.assertEquals("Schema ID should match", Integer.valueOf(table.schema().schemaId()), snap.schemaId());


Nit: should this just cast to Integer like the test case below?

api/src/test/java/org/apache/iceberg/TestHelpers.java

rdblue · 2021-06-29T00:21:44Z

core/src/test/java/org/apache/iceberg/TestSchemaID.java

+
+    TestHelpers.assertSameSchemaMap(onlySchemaMap, table.schemas());
+    Assert.assertEquals("Current snapshot's schemaId should be the current",
+        table.schema().schemaId(), (int) table.currentSnapshot().schemaId());


This should be okay, but in the future you may want to instead use the Assert.assertEquals(String, Object, Object) method. That way, you get the output of the assertion if the schema ID is null, rather than a NullPointerException. I think that would be easier to debug.

rdblue · 2021-06-29T00:26:13Z

Thanks for fixing this, @yyanyy! I merged it.

github-actions bot added API core labels Feb 26, 2021

wypoon reviewed Mar 8, 2021

View reviewed changes

wypoon mentioned this pull request Mar 10, 2021

Use schema at the time of the snapshot when reading a snapshot. #1508

Merged

rdblue reviewed Mar 31, 2021

View reviewed changes

yyanyy changed the title ~~Core: add schema id to snapshot and history entry~~ Core: add schema id to snapshot Apr 6, 2021

yyanyy force-pushed the schema-id-logs branch 2 times, most recently from bf53dc8 to 6be4cf9 Compare April 6, 2021 06:36

openinx reviewed Apr 7, 2021

View reviewed changes

zhangjun0x01 mentioned this pull request Apr 27, 2021

Core : snatshot read with schema change #2522

Closed

rdblue mentioned this pull request May 31, 2021

Update spec for v2 changes #2654

Merged