-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: add schema id to snapshot #2275
Conversation
yyanyy
commented
Feb 26, 2021
•
edited
Loading
edited
- Followup of Core: add schema id and schemas to table metadata #2096 to add schema id to snapshot.
- Also addressed the remaining comments in Core: add schema id and schemas to table metadata #2096 in this commit
- Will added some util methods for easier retrieving of snapshot/schema information in follow up PRs.
I have some questions: |
assertSameSchemaMap(ImmutableMap.of(0, oldSchema, 1, updatedSchema), table.schemas()); | ||
} | ||
|
||
private void validateSnapshotsAndHistoryEntries(int numElement) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have this helper method take a List<Integer>
containing the schema ids instead of just an int (number of entries). Then testSchemaIdChangeInSchemaUpdate
would be able to use this helper method for the cases after updating the schema. snapshots.size()
should then match the List
's size, and the snapshot.schemaId()
in each snapshot in snapshots
should match the corresponding Integer
in the List
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, this makes the code in this class much easier to read; thank you!
import org.apache.iceberg.relocated.com.google.common.collect.Iterables; | ||
import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class SnapshotUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the static methods added here, snapshotIdFromTime
and schemaOfSnapshot
, used anywhere in this change. If that is the case, this file can be excluded from this particular PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will remove and create a separate PR for that.
{ | ||
"snapshot-id": 3051729675574597004, | ||
"timestamp-ms": 1515100955770 | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggests to me that not all snapshots in the snapshot-log
need to have a schema-id
. Is this what happens if the snapshot was written when the format was v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we couldn't get the past schema/id for old snapshots, so I think this is the time when we need to go back to your PR's change to figure out the correct id.
Thank you for the review, and sorry for the delay responding! I think this change applies to v1 tables as well. When the engine starts to use a release with this change, the new 0 is a valid schema-id and it will refer to a unique schema in metadata file; if there's no schema evolution after the table starts to write |
|
||
SnapshotLogEntry(long timestampMillis, long snapshotId) { | ||
SnapshotLogEntry(long timestampMillis, long snapshotId, Integer schemaId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add the schema ID to the snapshot log as well as the snapshot? The snapshots in the log are all available in table metadata, so it doesn't seem like there is a benefit to adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! I think I originally added it to history so that we can directly query history entries to get both snapshot ID and schema ID when we do time based time travel queries, but later I didn't end up using it for the utility methods I used to have so I think we can drop it. I'll update the PR to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title and description of the PR can be amended to reflect this then.
@yyanyy thank you for responding to my question. But because I am slow, I needed to do some testing to understand for myself the effects of #2096 and this PR.
I rebuilt Iceberg with my change on top of the previous changes, and I was able to see that the correct schema is used when viewing any snapshot (time travel). |
Thank you for spending time verifying the changes, and described the steps here! |
bf53dc8
to
6be4cf9
Compare
* @return schema id associated with this snapshot | ||
*/ | ||
default Integer schemaId() { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the case that the information will be null
? And if it's null, then how could people read the correct schema for the snapshot ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think you mean if people read the old metadata, its schema id from snapshots will be null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, schemaId() returns null in the case where the snapshot was written before this change. Note though, that even after this change, new metadata can have snapshots without schema id (so schemaId() for those snapshots will return null), if it is metadata for a table existing before this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a PR (#1508) that reads previous metadata to get the schema for the snapshot in case Snapshot#schemaId()
returns null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
map1.forEach((schemaId, schema1) -> { | ||
Schema schema2 = map2.get(schemaId); | ||
Assert.assertNotNull(String.format("Schema ID %s should exist in both map", schemaId), schema2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we could make this error message more clear here because the given schemaId is definitely not found in the map2 if the assert failure happens.
@@ -147,6 +147,7 @@ public String location() { | |||
return properties; | |||
} | |||
|
|||
// Note that schema parsed from string does not contain the correct schema ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the comment I had in #2465 (comment) that we didn't persist schema id within toJson()
which is called when we serialize the table.
Honestly for this case I think we can go either way; we can use a different toJson
implementation when serialize the table since schema at that time almost guaranteed to be the original schema from table metadata; however since the only usage of schemaId
is for time travel queries, and this use case doesn't need id from the current schema
itself so adding it isn't necessary, and not having schema Id is not something we don't expect per this note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the correct toJson
implementation so that we don't need to note that the ID doesn't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually before we decided to write down schema id as part of toJson
; now that toJson
always writes schema id, this note is no longer relevant and I'll remove it.
long timestampMillis, | ||
String operation, | ||
Map<String, String> summary, | ||
Integer schemaId, List<ManifestFile> dataManifests) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's use two separate lines for those two constructor variables, that's more clear.
generator.writeEndObject(); | ||
} | ||
|
||
private static void writeSnapshotRelated(TableMetadata metadata, JsonGenerator generator) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this refactor is necessary any more. It just creates a method that is only used once. I don't think that this file needs to change at all.
// update schema | ||
table.updateSchema().addColumn("data2", Types.StringType.get()).commit(); | ||
|
||
Schema updatedSchema = new Schema(1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just get the table's current schema? Or is this intended to check that schema evolution produces the right ID in addition to the snapshot tracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was hoping to make the check explicitly that schema id is now different
@@ -308,6 +308,8 @@ void validateSnapshot(Snapshot old, Snapshot snap, Long sequenceNumber, DataFile | |||
} | |||
|
|||
Assert.assertFalse("Should find all files in the manifest", newPaths.hasNext()); | |||
|
|||
Assert.assertEquals("Schema ID should match", Integer.valueOf(table.schema().schemaId()), snap.schemaId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this just cast to Integer
like the test case below?
|
||
TestHelpers.assertSameSchemaMap(onlySchemaMap, table.schemas()); | ||
Assert.assertEquals("Current snapshot's schemaId should be the current", | ||
table.schema().schemaId(), (int) table.currentSnapshot().schemaId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be okay, but in the future you may want to instead use the Assert.assertEquals(String, Object, Object)
method. That way, you get the output of the assertion if the schema ID is null, rather than a NullPointerException
. I think that would be easier to debug.
Thanks for fixing this, @yyanyy! I merged it. |