Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add schema id to snapshot #2275

Merged
merged 1 commit into from
Jun 29, 2021
Merged

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Feb 26, 2021

@wypoon
Copy link
Contributor

wypoon commented Mar 8, 2021

I have some questions:
When we switch from v1 format to v2, and a new metadata file is written for an existing table, what schemas are written to the schemas list? And in the snapshot-log, what schema-id is written for the previous snapshots? (Is it not written, i.e., is null? or is it 0?)
In general, if we see a schema id of 0, does that ever represent a specific schema, or does that always represent some undetermined schema? Let me elaborate: (1) Will we ever see a schema-id of 0 in a metadata file and if so, does that refer to a unique schema? (2) In code, if we have an instance of a schema and its schemaId is 0, what are the semantics of that schemaId?

assertSameSchemaMap(ImmutableMap.of(0, oldSchema, 1, updatedSchema), table.schemas());
}

private void validateSnapshotsAndHistoryEntries(int numElement) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have this helper method take a List<Integer> containing the schema ids instead of just an int (number of entries). Then testSchemaIdChangeInSchemaUpdate would be able to use this helper method for the cases after updating the schema. snapshots.size() should then match the List's size, and the snapshot.schemaId() in each snapshot in snapshots should match the corresponding Integer in the List.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, this makes the code in this class much easier to read; thank you!

import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
import org.apache.iceberg.relocated.com.google.common.collect.Lists;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class SnapshotUtil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the static methods added here, snapshotIdFromTime and schemaOfSnapshot, used anywhere in this change. If that is the case, this file can be excluded from this particular PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will remove and create a separate PR for that.

{
"snapshot-id": 3051729675574597004,
"timestamp-ms": 1515100955770
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests to me that not all snapshots in the snapshot-log need to have a schema-id. Is this what happens if the snapshot was written when the format was v1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we couldn't get the past schema/id for old snapshots, so I think this is the time when we need to go back to your PR's change to figure out the correct id.

@yyanyy
Copy link
Contributor Author

yyanyy commented Mar 13, 2021

I have some questions:
When we switch from v1 format to v2, and a new metadata file is written for an existing table, what schemas are written to the schemas list? And in the snapshot-log, what schema-id is written for the previous snapshots? (Is it not written, i.e., is null? or is it 0?)
In general, if we see a schema id of 0, does that ever represent a specific schema, or does that always represent some undetermined schema? Let me elaborate: (1) Will we ever see a schema-id of 0 in a metadata file and if so, does that refer to a unique schema? (2) In code, if we have an instance of a schema and its schemaId is 0, what are the semantics of that schemaId?

Thank you for the review, and sorry for the delay responding!

I think this change applies to v1 tables as well. When the engine starts to use a release with this change, the new schemas list will be written with the current schema and 0 as its default schema id. And in snapshot-log, previous snapshots will have null schema-id since they were not available when they were written.

0 is a valid schema-id and it will refer to a unique schema in metadata file; if there's no schema evolution after the table starts to write schemas, 0 will be assigned to the current schema. And in the code, since we only care about id during the interaction with table metadata, and throughout the process when schema class is used as various classes for doing projection etc, schemaId will always be 0, and that is just a default value and shouldn't be used. #2096 has some conversation around this, and this behavior is mentioned in schema class.


SnapshotLogEntry(long timestampMillis, long snapshotId) {
SnapshotLogEntry(long timestampMillis, long snapshotId, Integer schemaId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the schema ID to the snapshot log as well as the snapshot? The snapshots in the log are all available in table metadata, so it doesn't seem like there is a benefit to adding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! I think I originally added it to history so that we can directly query history entries to get both snapshot ID and schema ID when we do time based time travel queries, but later I didn't end up using it for the utility methods I used to have so I think we can drop it. I'll update the PR to reflect this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title and description of the PR can be amended to reflect this then.

@wypoon
Copy link
Contributor

wypoon commented Apr 3, 2021

@yyanyy thank you for responding to my question. But because I am slow, I needed to do some testing to understand for myself the effects of #2096 and this PR.
I started with Iceberg 0.11.0 and created some tables, inserting data into them, and altering the schema. Then I added the commit for #2096 and the changes here on top (actually I added the commit for #2089 first, so that the backports are clean). I made further updates to my Iceberg tables, adding data and altering the schema.
From what I see, when the table is updated and a new metadata.json file for the table is written, the metadata.json file still has a schema field, but the schema now has a schema-id field; it also has a schemas field containing an array of schemas, and a current-schema-id field. The only schema known at the time of the switchover is the current schema of the table, so if the table change is just change to data, the existing schema is given a schema-id of 0, and the schemas array contains a single schema; if the table change is a schema change, then the old schema is given a schema-id of 0, the new schema is given a schema-id of 1, and the schemas array contains two schemas (and the current-schema-id field is 1). In the snapshots field, any snapshot created before the switchover does not have a schema-id; snapshots created after the switchover are given a schema-id, corresponding to the schema current at the time the snapshot is written.
When Snapshot#schemaId() is called, if the snapshot is written before the switchover, null is returned, and if written after the switchover, a non-null Integer is returned.
It is as you wrote.
With this change, I think it is straightforward to update my PR, #1508. I just need to update the following method I'm adding to BaseTable

  public Schema schemaForSnapshot(long snapshotId) {
    TableMetadata current = ops.current();
    // First, check if the snapshot has a schema id associated with it
    Snapshot snapshot = current.snapshot(snapshotId);
    Integer schemaId = snapshot.schemaId();
    if (schemaId != null) {
      return current.schemasById().get(schemaId);
    }
    // Otherwise, read each of the previous metadata files until we find one whose current
    // snapshot id is the snapshot id
    ...
  }

I rebuilt Iceberg with my change on top of the previous changes, and I was able to see that the correct schema is used when viewing any snapshot (time travel).
When this change is merged, I will update my PR. I hope that it can be considered then.

@yyanyy
Copy link
Contributor Author

yyanyy commented Apr 6, 2021

@yyanyy thank you for responding to my question. But because I am slow, I needed to do some testing to understand for myself the effects of #2096 and this PR.
I started with Iceberg 0.11.0 and created some tables, inserting data into them, and altering the schema. Then I added the commit for #2096 and the changes here on top (actually I added the commit for #2089 first, so that the backports are clean). I made further updates to my Iceberg tables, adding data and altering the schema.
From what I see, when the table is updated and a new metadata.json file for the table is written, the metadata.json file still has a schema field, but the schema now has a schema-id field; it also has a schemas field containing an array of schemas, and a current-schema-id field. The only schema known at the time of the switchover is the current schema of the table, so if the table change is just change to data, the existing schema is given a schema-id of 0, and the schemas array contains a single schema; if the table change is a schema change, then the old schema is given a schema-id of 0, the new schema is given a schema-id of 1, and the schemas array contains two schemas (and the current-schema-id field is 1). In the snapshots field, any snapshot created before the switchover does not have a schema-id; snapshots created after the switchover are given a schema-id, corresponding to the schema current at the time the snapshot is written.
When Snapshot#schemaId() is called, if the snapshot is written before the switchover, null is returned, and if written after the switchover, a non-null Integer is returned.
It is as you wrote.
With this change, I think it is straightforward to update my PR, #1508. I just need to update the following method I'm adding to BaseTable

  public Schema schemaForSnapshot(long snapshotId) {
    TableMetadata current = ops.current();
    // First, check if the snapshot has a schema id associated with it
    Snapshot snapshot = current.snapshot(snapshotId);
    Integer schemaId = snapshot.schemaId();
    if (schemaId != null) {
      return current.schemasById().get(schemaId);
    }
    // Otherwise, read each of the previous metadata files until we find one whose current
    // snapshot id is the snapshot id
    ...
  }

I rebuilt Iceberg with my change on top of the previous changes, and I was able to see that the correct schema is used when viewing any snapshot (time travel).
When this change is merged, I will update my PR. I hope that it can be considered then.

Thank you for spending time verifying the changes, and described the steps here!

@yyanyy yyanyy changed the title Core: add schema id to snapshot and history entry Core: add schema id to snapshot Apr 6, 2021
@yyanyy yyanyy force-pushed the schema-id-logs branch 2 times, most recently from bf53dc8 to 6be4cf9 Compare April 6, 2021 06:36
* @return schema id associated with this snapshot
*/
default Integer schemaId() {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the case that the information will be null ? And if it's null, then how could people read the correct schema for the snapshot ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think you mean if people read the old metadata, its schema id from snapshots will be null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, schemaId() returns null in the case where the snapshot was written before this change. Note though, that even after this change, new metadata can have snapshots without schema id (so schemaId() for those snapshots will return null), if it is metadata for a table existing before this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR (#1508) that reads previous metadata to get the schema for the snapshot in case Snapshot#schemaId() returns null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in response, and thank you @openinx for the review! And thank you @wypoon for responding!


map1.forEach((schemaId, schema1) -> {
Schema schema2 = map2.get(schemaId);
Assert.assertNotNull(String.format("Schema ID %s should exist in both map", schemaId), schema2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we could make this error message more clear here because the given schemaId is definitely not found in the map2 if the assert failure happens.

@@ -147,6 +147,7 @@ public String location() {
return properties;
}

// Note that schema parsed from string does not contain the correct schema ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the comment I had in #2465 (comment) that we didn't persist schema id within toJson() which is called when we serialize the table.

Honestly for this case I think we can go either way; we can use a different toJson implementation when serialize the table since schema at that time almost guaranteed to be the original schema from table metadata; however since the only usage of schemaId is for time travel queries, and this use case doesn't need id from the current schema itself so adding it isn't necessary, and not having schema Id is not something we don't expect per this note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the correct toJson implementation so that we don't need to note that the ID doesn't match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually before we decided to write down schema id as part of toJson; now that toJson always writes schema id, this note is no longer relevant and I'll remove it.

long timestampMillis,
String operation,
Map<String, String> summary,
Integer schemaId, List<ManifestFile> dataManifests) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's use two separate lines for those two constructor variables, that's more clear.

generator.writeEndObject();
}

private static void writeSnapshotRelated(TableMetadata metadata, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this refactor is necessary any more. It just creates a method that is only used once. I don't think that this file needs to change at all.

// update schema
table.updateSchema().addColumn("data2", Types.StringType.get()).commit();

Schema updatedSchema = new Schema(1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just get the table's current schema? Or is this intended to check that schema evolution produces the right ID in addition to the snapshot tracking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was hoping to make the check explicitly that schema id is now different

@@ -308,6 +308,8 @@ void validateSnapshot(Snapshot old, Snapshot snap, Long sequenceNumber, DataFile
}

Assert.assertFalse("Should find all files in the manifest", newPaths.hasNext());

Assert.assertEquals("Schema ID should match", Integer.valueOf(table.schema().schemaId()), snap.schemaId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this just cast to Integer like the test case below?


TestHelpers.assertSameSchemaMap(onlySchemaMap, table.schemas());
Assert.assertEquals("Current snapshot's schemaId should be the current",
table.schema().schemaId(), (int) table.currentSnapshot().schemaId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay, but in the future you may want to instead use the Assert.assertEquals(String, Object, Object) method. That way, you get the output of the assertion if the schema ID is null, rather than a NullPointerException. I think that would be easier to debug.

@rdblue rdblue merged commit d4d376b into apache:master Jun 29, 2021
@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2021

Thanks for fixing this, @yyanyy! I merged it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants