Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add schema id and schemas to table metadata #2096

Merged
merged 6 commits into from
Feb 23, 2021

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Jan 16, 2021

This PR adds current-schema-id and schemas to table metadata. It also introduces a wrapper around schema to associate table schema with id.
The reason to not add ID directly into Schema is that currently schema creation is widely used as a convenient method for a lot of actions that don't involve a "real" table schema.

Next steps:

  • adds schema-id to snapshot interface and populate
  • use history entries and schemas to look up the right schema in time travel queries; this may mean to add schemas() in Table API
  • spec update
  • add schema id to historyTable (will be mentioned later)

Open questions:

  1. Current approach writes the newly introduced fields to JSON by default even in v1, and there could be forward/backward compatibility concern with the current approach: if a new writer writes (with ID 0) and then update (with ID 1) schema, metadata will store both schema ID 0 and 1, and default ID will be 1. Then an old writer reads and writes the metadata for whatever change, which drops schema 0 in metadata. Then when a new writer picks up the metadata again, the original schema 0 is gone, and 1 is replaced with ID 0. This could result in schema ID consistency issue among different writers.
    • Since ID is introduced in this PR, there is no metadata table that exposes these inconsistent schema IDs, so we may not have this problem for now. However when we start to add schema ID to historyTable metadata table, at different time ID 0 could mean different things in this history table. We could potentially workaround this by only exposing schemaID field in historyTable only for v2 tables, or mention this caveat on spec.
    • Alternatively we can expose these two fields only in v2 table, and time travel queries in v1 always rely on looking at old table metadata files as implemented in Use schema at the time of the snapshot when reading a snapshot. #1508. This could mean in future any new changes that may depend on schema ID cannot be introduced in v1.
  2. Do we want to add a last-assigned-schema-id to table metadata? My answer would be yes, for a similar reason mentioned in this comment
  3. Do we want to assign special schema IDs to metadata tables, to avoid potential collision on schema IDs? Currently I don't think this is going to be a problem since different data tables could also have this collision problem, and plan to assign 0 for all metadata tables' schema for now, but wanted to bring it up as this may not be easy to fix if we want to revisit this decision.
  4. Do we want to add snapshot-id to only history entries, or Snapshot interface? I think former is enough for fixing the time travel query problem, but we might want the latter in case we need this information later.
  5. I think currently when replacing a table, earlier history entries/snapshotLog will be reset to empty (second to last argument in here). Is this expected? do we want to fix this as a separate issue?

@@ -242,6 +255,7 @@ public static TableMetadata read(FileIO io, InputFile file) {
}
}

@SuppressWarnings("checkstyle:CyclomaticComplexity")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this also shows up when adding primary key #2010 , I think we should probably find some way to refactor the code a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I imagine there would be some conflict when either of the PRs got merged. I think the main problem with refactoring this is that a lot of the things it parsed should be parsed together/validate each other; e.g. parse of schema array has to come together with current schema id since both will either present at the same time, or derived from a different field, and same goes for sort order/partition spec. Since Java only allows returning a single object per method, we either have to refactor the code a bit and ensure the logic flow doesn't change and is readable, or let the new methods to return things like tuple<Integer, List<Schema>> that I'm not sure which way is cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could have a method that parses schemas and a separate one that returns the current schema ID. That said, I'm fine with suppressing the complexity warning for now.

/**
* A wrapper around Schema to include its schema ID
*/
public class VersionedSchema implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest question I have for this PR is whether we should have this wrapped class for adding schema ID, or we should just add schema ID to the Schema class itself, maybe as an optional Integer. Because Iceberg have its own jackson parser for serdes, it is not hard to ignore the schemaId field or set a default for it for backwards compatibility. What is the benefit of this VersionedSchema approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this briefly in the PR description:

The reason to not add ID directly into Schema is that currently schema creation is widely used as a convenient method for a lot of actions that don't involve a "real" table schema.

The main concern I had with adding schema ID is that it might be confusing to tell from the code base when a schema should have ID and when not. A schema only needs to have ID when it's really a table schema strictly associated with table metadata, but outside of that context, the ID is not useful in any "real" operation of the schema, especially considering that it's often used for conveniently mutate fields for things like projection. Maybe we should rename VersionedSchema to something like TableSchema to better reflect this?

Copy link
Contributor

@rdblue rdblue Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @jackye1995 here. We could add the ID to Schema and just have it use either null or 0 if there is no schema ID. I think we can use 0.

If there is no schema ID, then there will not be a schemas list, and all of the snapshots would not have a schema ID. So if an older writer updates a table, then we can expect the current schema ID to be dropped, older schemas to be dropped, snapshot schema IDs to be dropped, etc. So it is consistent with having just one schema. That should be safe for v1 because we won't leave any dangling IDs that could reference the wrong schema later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment above is thinking through your question 1 in the description. I think it is okay to do this as you have it here.

@Override
public int hashCode() {
return Objects.hash(struct, aliasToId);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add equals and hashCode to Schema? The reason why we didn't already have this implemented is that you can compare the type using schema.asStruct().equals(...). And we don't want to define whether two schemas are "equal" based on the aliases because the same schema can have different original aliases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new change I introduced, we can reuse the old schema ID during schema update, if the exact same schema already exists in the schema list. I think in this case we may need to confirm both schemas' aliases should be the same? Or do you think there shouldn't be a case when schema are the same while aliases are not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that should use aliases. Aliases are populated when converting from a file schema to Iceberg, so that we can look up the file's column names from IDs and vice versa. That information is used by integration and not by Iceberg itself. The aliases are not serialized when writing a schema into Iceberg metadata.

For the purposes of maintaining the schema list, we can ignore aliases and just use schema.asStruct().

@@ -161,8 +163,19 @@ private static void toJson(TableMetadata metadata, JsonGenerator generator) thro
generator.writeNumberField(LAST_UPDATED_MILLIS, metadata.lastUpdatedMillis());
generator.writeNumberField(LAST_COLUMN_ID, metadata.lastColumnId());

generator.writeFieldName(SCHEMA);
SchemaParser.toJson(metadata.schema(), generator);
// for older readers, continue writing the current schema as "schema"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to note that this is done until v2 because support for schemas and current-schema-id is required in v2 and later.

@rdblue
Copy link
Contributor

rdblue commented Jan 30, 2021

Do we want to add a last-assigned-schema-id to table metadata?

My initial thought was no because the schema IDs are completely local to a metadata file. We can reassign all IDs from one file to the next and as long as they are internally consistent we are okay. The IDs are not embedded in separate files that might get improperly used.

But, the Nessie comment you pointed to is a good one to think about. I think it is still okay because Nessie merges would either be fast-forward, or would re-apply changes and create entirely new metadata files. So the internal consistency argument still holds.

Do we want to assign special schema IDs to metadata tables, to avoid potential collision on schema IDs?

I don't think so.

Do we want to add snapshot-id to only history entries, or Snapshot interface?

I would add it to Snapshot, not just History. Snapshot is what will be used for time travel queries. I would avoid adding too much metadata to history.

I think currently when replacing a table, earlier history entries/snapshotLog will be reset to empty (second to last argument in here). Is this expected? do we want to fix this as a separate issue?

I think we should follow up and keep the history. I think the reason for this is because we previously didn't have any compatibility across schemas (IDs were completely reassigned, so time travel would be incorrect). But that's fixed now.

@yyanyy
Copy link
Contributor Author

yyanyy commented Feb 2, 2021

Do we want to add a last-assigned-schema-id to table metadata?

My initial thought was no because the schema IDs are completely local to a metadata file. We can reassign all IDs from one file to the next and as long as they are internally consistent we are okay. The IDs are not embedded in separate files that might get improperly used.

But, the Nessie comment you pointed to is a good one to think about. I think it is still okay because Nessie merges would either be fast-forward, or would re-apply changes and create entirely new metadata files. So the internal consistency argument still holds.

Do we want to assign special schema IDs to metadata tables, to avoid potential collision on schema IDs?

I don't think so.

Do we want to add snapshot-id to only history entries, or Snapshot interface?

I would add it to Snapshot, not just History. Snapshot is what will be used for time travel queries. I would avoid adding too much metadata to history.

I think currently when replacing a table, earlier history entries/snapshotLog will be reset to empty (second to last argument in here). Is this expected? do we want to fix this as a separate issue?

I think we should follow up and keep the history. I think the reason for this is because we previously didn't have any compatibility across schemas (IDs were completely reassigned, so time travel would be incorrect). But that's fixed now.

Thank you for the review and the input! I'll create a separate issue to mention the history problem and link here. Do you have any suggestion on if we want to add schema id to the history table?

@rdblue
Copy link
Contributor

rdblue commented Feb 3, 2021

Do you have any suggestion on if we want to add schema id to the history table?

I don't think that we want to add it to history. The snapshot log should refer to snapshots by ID and we can get the schema easily enough from there. For the actual metadata table, we don't want to expose schema ID because that's something that users should ideally never need to know much about. We've avoided adding IDs in other similar situations. For example, you can rewrite manifests for a specific partition spec, by ID. But we didn't include that in the stored procedure because we haven't exposed the partition spec ID anywhere to SQL users.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me from code, I am testing locally with this branch to see if there are any other potential issues, will reply back after that.

@@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
.fields());
}

/**
* Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: function should be outside the bracket

Copy link
Contributor Author

@yyanyy yyanyy Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I think either way may make sense? We either link nextId function or nextId to NextId interface, and I think both are valid interpretations? I'm also a bit hesitant to change since I copied this from another method, and there are a few others that use this style.

@@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
}
}

public static void toJsonWithVersion(Schema schema, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be called toJsonWithId, just to match the fromJsonWithId below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will update

@jackye1995
Copy link
Contributor

Works fine for me so far, tested a few operations among a few clusters of different versions.

One behavior I see is that, if an old v1 writer is somehow still used to create a new table version after v2 writer is used, all schema history will be dropped, so new schema written by v2 writer will have schema id starting from 0 again, which can break time travel query.

In theory, once a table is written by a v2 writer, it should never be touched by a v1 writer, but we cannot prevent people from doing that in real life, especially when the organization is large and people use different versions of the same software all the time. So I feel we still need something like a last-assigned-schema-id, so that the schemas don't accidentally share the same id and we can backfill the earlier schema and their IDs if necessary. But last-assigned-schema-id field should not as a part of the table metadata field because it will be dropped by an old writer. One way to go is to maybe put it in table properties, but there might be better ways. What do you think?

@yyanyy
Copy link
Contributor Author

yyanyy commented Feb 9, 2021

Works fine for me so far, tested a few operations among a few clusters of different versions.

One behavior I see is that, if an old v1 writer is somehow still used to create a new table version after v2 writer is used, all schema history will be dropped, so new schema written by v2 writer will have schema id starting from 0 again, which can break time travel query.

In theory, once a table is written by a v2 writer, it should never be touched by a v1 writer, but we cannot prevent people from doing that in real life, especially when the organization is large and people use different versions of the same software all the time. So I feel we still need something like a last-assigned-schema-id, so that the schemas don't accidentally share the same id and we can backfill the earlier schema and their IDs if necessary. But last-assigned-schema-id field should not as a part of the table metadata field because it will be dropped by an old writer. One way to go is to maybe put it in table properties, but there might be better ways. What do you think?

Thanks for putting efforts into testing this!

I think the problem you described seem to be similar to what I had in my question 1, and I think we might be fine with that since the IDs within the metadata file should always be consistent, and we don't expose them in metadata table as Ryan mentioned above. I think it wouldn't break time travel query, as after the old writer writes the data, the schema ID with the snapshot got dropped, in which case we will likely need to rely on #1508 for time travel?

@jackye1995
Copy link
Contributor

I think the problem you described seem to be similar to what I had in my question 1, and I think we might be fine with that since the IDs within the metadata file should always be consistent, and we don't expose them in metadata table as Ryan mentioned above. I think it wouldn't break time travel query, as after the old writer writes the data, the schema ID with the snapshot got dropped, in which case we will likely need to rely on #1508 for time travel?

I see, that makes sense.

@@ -54,6 +58,7 @@
private transient Map<Integer, String> idToName = null;

public Schema(List<NestedField> columns, Map<String, Integer> aliases) {
this.schemaId = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a constant for the default schema ID. That way we can reference it in other places when we fill in the default.

@@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
}
}

public static void toJsonWithId(Schema schema, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called in TableMetadataParser and I don't see why there should be two toJson methods. It isn't a problem for the v1 metadata to be written with the schema ID because extra fields are ignored. So I think that this method isn't needed and toJson should always pass the ID to the struct writer.

toJson(struct, null, generator);
}

static void toJson(Types.StructType struct, Integer schemaId, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the struct toJson methods aren't called outside of this file. What do you think about making them private?

@@ -253,4 +266,15 @@ public static Schema fromJson(String json) {
}
});
}

public static Schema fromJsonWithId(int schemaId, JsonNode json) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if all of the parsing methods returned the correct ID instead of having some that use a version that ignores the ID.

I think you chose to use two methods because there are some situations where the ID wasn't written and you don't want the schema to have the wrong ID. But, because the fromJson method uses the Schema constructor that does not pass the ID, the schemas will have the default ID, 0, anyway.

As long as the schemas have an ID, I think there is no need to have multiple parser methods, both with and without ID. Also, we need to make sure that the IDs are consistent when reading the schema from files that were written before IDs. That should just be manifest files, where the schema is embedded in the header.

For manifests, we need to add the schema ID when writing (https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L190). And when reading, we will need to update how the spec is constructed. If the spec is loaded by ID then it's fine. Otherwise we should load the schema by ID to parse the spec, and if that doesn't work we should parse both the schema and the spec (https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestReader.java#L115-L120). I think we should probably also try to take the parsed schema and find the correct one from table metadata so that we don't have an incorrect ID anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed fromJsonWithId and update fromJson to use the constructor with schema Id, if Id present in Json blob, or default to constructor that doesn't require Id. Please let me know if it is expected!

Regarding persisting schema ID in manifest, since schema and partition-spec in both v1 and v2 are required in manifest metadata, I think it is unlikely that we will miss schema itself but has schema ID for lookup? And I think in manifest reader, schema is only used for constructing partition spec, so that having a correct schema ID might not be super necessary.

Today there are a lot of places that rely on toJson/fromJson for schema, e.g. Avro metadata, Hive table properties, scan tasks. In this PR, except for table metadata parser, those are all using the toJson version that do not write schemaId at all, which means that when the strings are read back to Schema object they will always have id of 0; and this goes back to the persisting schema ID question above, that we do not persist id in the manifest header, so we are not persisting an incorrect ID, but instead constructed a schema without ID that default to be 0 like everywhere else.

My original thinking was that schema Id should always be ignored/considered inaccurate unless we directly get it from table metadata, as after all kinds of projections it might be hard to track if the ID is correct. And the ID is only used for lookup, so if we have the schema struct, it's fine to have incorrect ID since we are unlikely to use it. But I might be overcomplicating the problem. Do you think we are able to guarantee the ID to be right everywhere, and do you have comment about the current state of toJson variations (that only the one used by metadata parser writes out ID)?

lastColumnId.get(), freshSchema, INITIAL_SPEC_ID, ImmutableList.of(freshSpec),
freshSpec.lastAssignedFieldId(), freshSortOrderId, ImmutableList.of(freshSortOrder),
lastColumnId.get(), freshSchema.schemaId(), ImmutableList.of(freshSchema),
INITIAL_SPEC_ID, ImmutableList.of(freshSpec), freshSpec.lastAssignedFieldId(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you use freshSchema.schemaId() instead of using INITIAL_SCHEMA_ID again. Should we do the same for INITIAL_SPEC_ID?


private int reuseOrCreateNewSchemaId(Schema newSchema) {
// if the schema already exists, use its id; otherwise use the highest id + 1
int newSchemaId = currentSchemaId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be currentSchemaId + 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the for loop below it will check if each existing schema's ID is greater or equal to this newSchemaId and +1 if so, so I think at the end of the method it will be currentSchemaId + 1 if no reuse happens?

}
}

private int reuseOrCreateNewSchemaId(Schema newSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this down to the bottom with other private helper methods?

}

ImmutableList.Builder<Schema> builder = ImmutableList.<Schema>builder()
.addAll(schemas);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be on the previous line?

defaultSortOrderId, sortOrders, properties,
currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
addPreviousFile(file, lastUpdatedMillis));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this change wasn't needed and just moves 3 variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about these, I got confused on which changes were mine when resolving merge conflicts, I'll be more careful next time.

snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots,
newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did snapshots need to move?

snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
addPreviousFile(file, lastUpdatedMillis));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it isn't necessary to move snapshots and snapshotLog.

snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
addPreviousFile(file, lastUpdatedMillis));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Could you remove unnecessary reformatting?

Preconditions.checkArgument(formatVersion == 1,
"%s must exist in format v%s", SCHEMAS, formatVersion);

currentSchemaId = TableMetadata.INITIAL_SCHEMA_ID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be schema.schemaId() instead of referencing the constant.


List<Schema> schemas;
int currentSchemaId;
Schema schema = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set this to null but not the two others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because in the first if statement, after parsing the schema array and assign the schema, I did a check to ensure the schema holding current schema ID exists in schema array. Without this null check, the Preconditions.checkArgument(schema != null, ...) will highlight with error warning "schema variable might not be initialized".

snapshots, entries.build(), metadataEntries.build());
lastSequenceNumber, lastUpdatedMillis, lastAssignedColumnId, currentSchemaId, schemas, defaultSpecId, specs,
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentVersionId, snapshots,
entries.build(), metadataEntries.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshots doesn't need to move.

@@ -100,8 +101,15 @@ public void testJsonConversion() throws Exception {
.add(new SnapshotLogEntry(currentSnapshot.timestampMillis(), currentSnapshot.snapshotId()))
.build();

Schema schema = new Schema(6,
Types.NestedField.required(10, "x", Types.StringType.get()));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra newline.

@@ -78,4 +82,21 @@ public void testSerialization() throws Exception {
Lists.transform(result.snapshots(), Snapshot::snapshotId));
Assert.assertEquals("History should match", meta.snapshotLog(), result.snapshotLog());
}

private void assertSameSchemaList(List<Schema> list1, List<Schema> list2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to factor this into a common util rather than copying it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a test util class in core, but looks like there's a TestHelpers in api module so I'll put it there.

@rdblue
Copy link
Contributor

rdblue commented Feb 13, 2021

Thanks @yyanyy! I think this is getting close. The main thing is not exposing extra id-aware methods in the parser and also reading/writing manifests with IDs. Adding schema ID to snapshots will be a follow-up, right?

@yyanyy
Copy link
Contributor Author

yyanyy commented Feb 18, 2021

Thanks @yyanyy! I think this is getting close. The main thing is not exposing extra id-aware methods in the parser and also reading/writing manifests with IDs. Adding schema ID to snapshots will be a follow-up, right?

Thanks for the review and sorry for delay responding! And yes, schema ID to snapshot will be a follow up PR.

TableMetadata freshTable = TableMetadata.newTableMetadata(
schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
Assert.assertEquals("Should use TableMetadata.INITIAL_SCHEMA_ID for current schema id",
TableMetadata.INITIAL_SCHEMA_ID, freshTable.currentSchemaId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The schema passed in uses the same ID as the one that is validated (INITIAL_SCHEMA_ID=0). I think to test this, it would be better to pass in a schema with a non-zero ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to change the schema id supplied to TableMetadata.newTableMetadata to be non-0? Actually in TableMetadata the input schema Id will be ignored and always assign from 0, so changing this will break the test... Although I think this current behavior should be fine since I think this method is only called when creating a new table, so there shouldn't be any existing schema within the table that will collide on schema ID.

);
TableMetadata sameSchemaTable = twoSchemasTable.updateSchema(sameSchema2, 2);
Assert.assertEquals("Should return same table metadata",
twoSchemasTable, sameSchemaTable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use assertSame since the object must be the same TableMetadata instance and not just equal to it?

Types.NestedField.required(4, "x", Types.StringType.get()),
Types.NestedField.required(6, "z", Types.IntegerType.get())
);
TableMetadata threeSchemaTable = revertSchemaTable.updateSchema(schema3, 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the last assigned ID should be 6 since z has ID 6.

@rdblue rdblue merged commit 255e952 into apache:master Feb 23, 2021
@rdblue
Copy link
Contributor

rdblue commented Feb 23, 2021

Looks great, thanks for updating this, @yyanyy!

There were a few minor comments in tests, but we can pick up those fixes later. I think overall this looks great!

Copy link
Contributor

@wypoon wypoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor nits. I know this has been merged, but I understand that you're planning a follow-up to address nits in the tests that Ryan Blue pointed out.

SEQ_NO, System.currentTimeMillis(), 3,
7, ImmutableList.of(TEST_SCHEMA, schema),
5, ImmutableList.of(SPEC_5), SPEC_5.lastAssignedFieldId(),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extraneous blank line.

@@ -223,6 +237,7 @@ public static String toJsonWithoutSpecList(TableMetadata metadata) {
generator.writeNumberField(LAST_UPDATED_MILLIS, metadata.lastUpdatedMillis());
generator.writeNumberField(LAST_COLUMN_ID, metadata.lastColumnId());

// mimic an old writer by writing only schema and not the current ID or schema list
generator.writeFieldName(SCHEMA);
SchemaParser.toJson(metadata.schema(), generator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that to properly test backward compatibility, the schema in this schema field should not have a schema-id, so this should be SchemaParser.toJson(metadata.schema().asStruct(), generator);

@lintingbin
Copy link
Contributor

@yyanyy @rdblue Whether you have considered cleaning up the schemas field, the schema that is not referenced by any snapshot will also be saved forever. It will cause the problem that the metadata file is too large. #5219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants