Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add row identifier to schema #2465

Merged
merged 8 commits into from
May 2, 2021
Merged

Conversation

jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Apr 12, 2021

Continuation of #2354

@yyanyy @rdblue @openinx @aokolnychyi

Spec with row identifier:

{
      "type": "struct",
      "schema-id": 1,
      "row-identifiers": [
        1,
        2
      ],
      "fields": [
        {
          "id": 1,
          "name": "x",
          "required": true,
          "type": "long"
        },
        {
          "id": 2,
          "name": "y",
          "required": true,
          "type": "long",
          "doc": "comment"
        },
        {
          "id": 3,
          "name": "z",
          "required": true,
          "type": "long"
        }
      ]
    }

New Schema toString:

table {
  fields {
    1: x: required long
    2: y: required long (comment)
    3: z: required long
  }
  row identifiers { 1,2 }
}

Update row identifier rules:

  1. row identifier should be added through UpdateSchema.addRowIdentifier(columnName)
  2. the column added should exist in schema or a part of the newly added columns (to make adding a new primary key a single atomic update)
  3. rename, move column should not affect row identifier because it is referencing the field IDs
  4. row identifier should be dropped through UpdateSchema.deleteRowIdentifier(columnName)
  5. it can only drop existing row identifier column
  6. a row identifier column cannot be dropped unless it is first dropped in the row identifiers list, to satisfy both use cases (1) user want to actually drop that row identifier column as an atomic update, (2) prevent user from directly dropping that column without knowing the implications

@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
if (pretty) {
generator.useDefaultPrettyPrinter();
}
toJson(schema.asStruct(), generator);
toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this method was not updated in the past, is there any risk to use the v2 json writer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR forget to refresh this toJson method.

Copy link
Contributor

@yyanyy yyanyy Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed the entire PR yet, just for this specific comment - the reason was mentioned in #2096 (comment). Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes schema object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version.

For the case of supporting row identifier to schema, I wonder if we want to do something similar - since row identifier is mostly useful during writing (to check table to get the default row identifier for delete) and not reading (since data/delete file should have this preserved in another place) or any kind of projection (especially that with the current change during projection we will lose row identifier information), to ensure that we don't involve in the complications when working with a mutated schema object I wonder if we want to have a separate toJson/fromJson to handle the row identifier parsing just for table metadata, and once we translate table metadata file to its model, we read row id from table instead of schema with something like table.getRowIdentifier which may help with table serialization/deserialization. But this approach seems to contradict with some aspects in #2354 (comment).

@rdblue since you involved in both conversations, could you please take a look? Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want to pass the schema ID here. Whenever we can, we should preserve the ID. My comment on the other PR was recommending that we preserve the ID where possible even when reading and writing manifest files.

I think I missed @yyanyy's follow-up comment about other places that embed the schema ID. I tend to agree with that line of reasoning, but because of it I'm coming to a different conclusion. Rather than removing the schema ID from those places, I think we should try to maintain them.

The risk to keeping schema IDs in other files is that another file may be written with a schema ID that is lost because an older writer (e.g. v1) didn't write out the schema list, only the current schema. If that happens, then an Avro file might write a schema with ID 1, which is then lost and later redefined. But as Yan noted, the actual schema (fields) are stored in those files and we don't need to look up the schema from the ID. So we have a choice of whether to discard the schema ID or not. I'd prefer not to drop it, since we can easily detect whether the schema actually matches the one with its ID and because we won't really need to do the lookup. I think it adds complexity to remember whether or not to drop the ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. manifest list and delete files. But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well...

I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema (example)

@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
if (pretty) {
generator.useDefaultPrettyPrinter();
}
toJson(schema.asStruct(), generator);
toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR forget to refresh this toJson method.

Types.StructType struct = TypeUtil
.visit(schema, new ApplyChanges(deletes, updates, adds, moves))
.asNestedType().asStructType();
return new Schema(struct.fields());
Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
.map(name -> struct.field(name).fieldId())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we delete column b by using the SchemaUpdate#deleteColumn API, then the struct variable which has been applied by those changes won't include the field that has the colum name b, right ? If sure, then how could we locate the field id for that given column name b in this sentence ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is why I decide to throw exception when deleteColumn is called and the column is a row identifier. There are 2 possible situations:

  1. the user actually wants to delete the column, then he/she should call:
update.deleteRowIdentifier(columnName).deleteColumn(columnName);
  1. the user does not know the column is used as a row identifier, and in that case just calling deleteColumn would result in exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoud we add the unit tests for above cases ? I mean we are deleting the rowIdentifierField and columns in the same txn .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should definitely have a test for this case.

Comment on lines 140 to 141
Supplier<C> builder, Consumer<T> appender,
Function<JsonNode, T> getter, Consumer<JsonNode> validator,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those builder, appender, getter and validator makes the method to be really hard to follow, could we just defines a simple visitor interface to visit the JsonNode in this collection and collect all the final result to the upper layer method ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, please see if the new approach looks better.

Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
.map(name -> struct.field(name).fieldId())
.collect(Collectors.toSet());
return new Schema(struct.fields(), refreshedRowIdentifiers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want a change to TableMetadata to ensure an update to only row identifier should still create a new schema; I think currently if a schema update does not change the underlying struct, we will consider the operation a no-op (comparison on struct is also why we don't have equals/hashcode in Schema)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I completely forgot that part... I added a new method sameSchema and now use that to check equality in table metadata.

* @param columnName name of the column to add as a row identifier
* @return this for method chaining
*/
UpdateSchema addRowIdentifier(String columnName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, I don't think that "row identifier" is the right noun. I'd prefer addIdentifierField and removeIdentifierField. I'd use remove rather than delete because it is more clear that the field itself will not be deleted, just removed from the identifier. I also think that we should make sure that behavior (removing from the identifier field list, not dropping the column) is clear in the Javadoc.

What about also adding setIdentifierFields? I think that a bulk set/replace operation fits more cleanly with the use case where users are setting the identifier for the first time or replacing it. I think that this will tend to be idempotent: set the identifier fields to account_id and profile_id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use remove rather than delete because it is more clear that the field itself will not be deleted

Sounds good, I also called it remove in the doc, delete was trying to match deleteColumn, but I agree it would cause some confusion.

What about also adding setIdentifierFields

I thought about setIdentifierFields, the reason I did not add it was because the UpdateSchema interface seems to be all individual operations instead of bulk operations (except for unionByNameWith), and our DDL SQL statements are also not bulk updates.

I think that this will tend to be idempotent

I think both approaches are idempotent. you can write it as:

update.setIdentifierFields(Sets.newHashSet("profile_id")).apply();

or

update.addIdentifierField("profile_id").removeIdentifierField("account_id").apply();

I am fine with both approaches. Technically if we go with setIdentifierFields, then we do not need addIdentifierField and removeIdentifierField and I can try to figure out the diff for every update.

@openinx do you think there is any advantage for any particular approach for the update API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, there should be no difference between the addIdentifierField/removeIdentifierField and setIdentifierFields, but for the perspective of usability, if people want to reset to use a totally different identifier fields then he/she will need to:

    Schema schema = ...
    Set<Integer> newFieldIds = ...

    UpdateSchema update = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID);
    for (Integer identifier : schema.rowIdentifiers()) {
      String columnName = schema.findColumnName(identifier);
      Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier);

      update.deleteRowIdentifier(columnName);
    }


    for (Integer identifier : schema.rowIdentifiers()) {
      String columnName = schema.findColumnName(identifier);
      Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier);

      update.addRowIdentifier(columnName);
    }

    update.apply();

It does make it more difficult to use the API. Besides the setIdentifierFields API, would you think it's necessary to introduce a Set<String> identifierColumns() in Schema ? I find the schema will always return the fieldId while the UpdateSchema will always use the columnName then we have to do the conversion from fieldId to columnName everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995, I don't think that the add/remove example you gave is idempotent because of the check identifierFieldNames.contains(name) that will result in an error: "Cannot remove column from identifier fields account_id because it is not an identifier field". @openinx gives a good example of why it's awkward to make users fall back to remove. And it is easy to make set work internally because you just clear the field ID set before calling add in a loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is not idempotent, and using set to write add/remove is easier than using add/remove to write set, I will change to set then, thanks for the feedbacks.

@@ -193,6 +204,11 @@ public UpdateSchema renameColumn(String name, String newName) {
updates.put(fieldId, Types.NestedField.of(fieldId, field.isOptional(), newName, field.type(), field.doc()));
}

if (identifierFieldNames.contains(name)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw exception if the newName has been already existed in identifierFieldNames (Maybe actually schema) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I think it would make sense to check whether newName is already a field in the schema. But that is catching a problem with a rename earlier and isn't really related to this PR.

Copy link
Contributor Author

@jackye1995 jackye1995 Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was expecting the lines above to catch this, but surprisingly it did not do this check. I can put up another small PR to fix it.

sb.append("\n }\n identifier fields { ");
sb.append(COMMA.join(identifierFieldIds()));
sb.append(" }\n}");
return sb.toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'd prefer a smaller change to the toString format. Maybe something like adding (id) to identifier fields in the list, rather than adding the identifier fields { ... }.

Preconditions.checkArgument(field != null,
"Cannot find column of the given name in existing or newly added columns: %s", name);
Preconditions.checkArgument(field.isRequired(),
"Cannot add column %s as an identifier field because it is not a required field", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this requirement is correct. In the case where we're adding a column and then making it part of the identifier, values for all of the existing data will be null. I don't see a reason why we can't consider, for example, (1, null) to be an identifier.

The strongest argument for making this a required field is SQL's boolean logic, where null is not equal to itself. But in Iceberg, we translate those expressions so that we can use normal boolean logic throughout the library. Equality deletes consider null equal to itself in Iceberg, just like set operations in SQL use null-safe equality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I was trying to match the primary key spec in relational SQL, but I agree that in our case of upsert null still makes sense. I will update the logic.

"Cannot add column %s as an identifier field because it is not a primitive field", name);
Preconditions.checkArgument(!identifierFieldNames.contains(name),
"Cannot add column %s as an identifier field because it already is", name);
identifierFieldNames.add(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also check that the column isn't nested within a map or list because those contain repeated keys, values, or elements. Anything within a repeated field can't be part of an identifier because there is more than one.

if (field == null) {
field = adds.get(TABLE_ROOT_ID).stream()
.filter(f -> f.name().equals(name))
.findAny().orElse(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the field is not a newly added field within a struct.

I think that this should use the same strategy that we use for addColumn. There are two variations of addColumn, one that accepts a parent name and a field name, and one that accepts only a field name. If the latter is called with a name that contains ., then an exception is thrown because the reference is ambiguous. For example, a.b could be a top-level field named a.b or could be field b nested within field a.

By structuring the methods that way, we always know the parent field and can use that instead of assuming TABLE_ROOT_ID here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining with your last comment, it seems like the best way to go is to just check the column isn't nested in a map/list, and in that case the root will always be TABLE_ROOT_ID. With that check added, I feel there is a bit redundant to have 2 methods like addColumn, because the private addIdentifierField method would always use root as parent. Let me know what you think with the new code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A column could be nested in a struct without being in a map or a list. For example, if you had a table of HTTP logs, you might have a column of request information with request.path and request.headers, along with a response column that has response.code. It would be valid to refer to response.code because there is only one per row. That's the case where we need the parent so we can distinguish between a top-level response.code and code within the response struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realize that we could add a single column name response.code in this addColumn API (Checked the code, sounds like the correct way to add response.code as top-level column is: addColumn(null, "response.code", type, doc) ).

If we wanna to distinguish between the nested columns and the top-level column whose name contains DOT, then we will need to add extra API for this , right ?

// This is used to add columns that does not contains DOT.
UpdateSchema setIdentifierFields(Colleciton<String> names);

// This is used to add nested columns or top-level columns that contains DOT
UpdateSchema setIdentifierFields(Collection<Pair<String, String>> parentAndNames); 

Copy link
Contributor

@rdblue rdblue Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx, that's what I was thinking at first as well, but I don't think that we need a version with parent now.

The reason is that once a column is added, it is uniquely identified by a dotted name: a.b might be a struct named a with a field named b or might be a field named a.b, but it cannot be both. We guarantee this so that we never have ambiguous names.

Because we know that the string passed in here will always identify just one field, we don't need to handle the ambiguous case by adding an explicit parent like we do with addColumn.

I think that means we can have a much nicer API:

  • setIdentifierFields(String...) and setIdentifierFields(Collection<String>)
  • addIdentifierField(String)

@@ -51,6 +53,7 @@
private final TableMetadata base;
private final Schema schema;
private final Map<Integer, Integer> idToParent;
private final Set<String> identifierFieldNames;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to use names instead of ids here? I think IDs would be a little simpler because you wouldn't need to translate in more cases. For example, rename wouldn't need to modify the identifier field ID set because it doesn't change the field IDs. You'd also only need to copy the set of field IDs returned by the current schema in the constructor.

I was looking for a reason why you might have chosen to use names, but I think that it is straightforward to use IDs instead of names in all cases. For example, the check in deleteColumn validates the name passed in directly, which seems easier; but the name has already been looked up in the schema so we have a field and could validate identifierFieldIds.contains(field.fieldId()) instead like the other checks do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes with the set approach the concern that led me to use name is gone, I will change to use ID.

@rdblue
Copy link
Contributor

rdblue commented Apr 18, 2021

Thanks, @jackye1995! I think overall the direction is looking right with identifierFieldIds. I just noted a few things to fix:

  • Need to add setIdentifierFields
  • Needs some implementation fixes (track by ID, etc.)
  • Need to refactor tests into separate cases

@jackye1995
Copy link
Contributor Author

jackye1995 commented Apr 22, 2021

@rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the identifier fields set, and at applyChange time, deletion of identifier column is checked before generating the new schema struct to validate deletion of identifier field, other validations are done after the new schema struct is generated.

Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using idToParent index. For new names, I need to do something very similar, but also merge the IdToParent with the parent information of newly added fields. So in the end the code was very similar to just creating a new schema and recreating the IdToParent index and a bit redundant. Combining those two facts, I think it is much cleaner to just store everything as name and then validate name and resolve back to ID after the new schema struct is generated.

Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling findField. I think that is a win to not increase complexity in addColumn and avoid new indexes in UpdateSchema.

I can also do the deletes check in the validation function, but in the end I chose to do it in a separated code block before generating the schema struct, because doing it in the validation method requires passing in many arguments, and that flow was not intuitive to code readers, it only increased the complexity of the code and reduced readability.

There are some other methods I tried, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route.

I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach.


SchemaUpdate(TableOperations ops) {
this.ops = ops;
this.base = ops.current();
this.schema = base.schema();
this.lastColumnId = base.lastColumnId();
this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct()));
this.identifierNames = schema.identifierFieldIds().stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It may be unrelated to this PR but I think it's good to reuse this two constructor into one so that we could have a simpler version of code:

SchemaUpdate(TableOperations ops) {
    this(ops, ops.current(), ops.current().schema(), ops.current().lastColumnId());
  }

  /**
   * For testing only.
   */
  SchemaUpdate(Schema schema, int lastColumnId) {
    this(null, null, schema, lastColumnId);
  }

  private SchemaUpdate(TableOperations ops, TableMetadata base, Schema schema, int lastColumnId) {
    this.ops = ops;
    this.base = base;
    this.schema = schema;
    this.lastColumnId = lastColumnId;
    this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct()));
    this.identifierNames = schema.identifierFieldIds().stream()
        .map(id -> schema.findField(id).name())
        .collect(Collectors.toSet());
  }

Preconditions.checkArgument(field.type().isPrimitiveType(),
"Cannot add field %s as an identifier field: not a primitive type field", name);
Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct());
validateIdentifierFieldParent(field.name(), field.fieldId(), newIdToParent, schema);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use a while loop (rather than a recursive method) to describe the parent field validation logics because that makes more clear:

  private static int validateIdentifierField(String name, Schema schema) {
    Types.NestedField field = schema.findField(name);
    Preconditions.checkArgument(field != null,
        "Cannot add field %s as an identifier field, not found in current schema or added columns");
    Preconditions.checkArgument(field.type().isPrimitiveType(),
        "Cannot add field %s as an identifier field: not a primitive type field", name);

    // Check whether the nested column is in a chain of struct field list.
    Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct());
    Integer parentId = newIdToParent.get(field.fieldId());
    while (parentId != null) {
      Types.NestedField parent = schema.findField(parentId);
      ValidationException.check(parent.type().isStructType(),
          "Cannot add field %s as an identifier field: must not be nested in %s", name, parent);
      parentId = newIdToParent.get(parent.fieldId());
    }
    return field.fieldId();
  }

@openinx
Copy link
Member

openinx commented Apr 23, 2021

The current patch looks good to me overall, I just left few minor comments.

@jackye1995
Copy link
Contributor Author

@openinx thanks, fixed based on your comments

@openinx
Copy link
Member

openinx commented Apr 26, 2021

@rdblue , would you have any other concern ? If no, I will plan to get this merged. Thanks.

* @param names names of the columns to set as identifier fields
* @return this for method chaining
*/
UpdateSchema setIdentifierFields(Set<String> names);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why require this to be a set instead of accepting any Collection or Iterable? Would a user never need to pass a List?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set was used to explicitly tell users that identifier fields need to be unique, instead of giving people an illusion that the update operation can still succeed with repeated value.

Technically we can use List, Iterable or Collection. If we would like to make a broader use case for the API, I can document the behavior in javadoc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use Collection. It's fine to collapse that into a set internally since a set of columns or a multi-set of columns has the same identity behavior. Collection is easier to use for the caller.

))
.apply();

AssertHelpers.assertThrows("add a nested field in struct of a map should fail",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different than the points.element.x case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to test struct -> list -> struct, instead of just list -> struct to trigger the while loop.

.setIdentifierFields(Sets.newHashSet("locations"))
.apply());

AssertHelpers.assertThrows("add a nested field in map should fail",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for trying to reference a field in a map value struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@jackye1995
Copy link
Contributor Author

@rdblue updated based on your comments. For Set versus List or anything else in the API, please let me know what you think should be the general guidance for this type of change in Iceberg.

@yyanyy
Copy link
Contributor

yyanyy commented Apr 29, 2021

LGTM overall, just want to bring the below comment (that's not really related to the focus of this PR itself) I had originally here to people's attention and make sure we are all aligned, since this PR does change the behavior of what we preserve in metadata fields. If we do go with the current approach, I'll raise a separate PR to mention the usage of schema ID and the caveat in spec.

Also @pvary since hive module uses schema toJson indicated in the comment below.

Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes schema object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version.
...
Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. manifest list and delete files. But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well...

I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema (example)

@pvary
Copy link
Contributor

pvary commented Apr 29, 2021

Also @pvary since hive module uses schema toJson indicated in the comment below.

Thanks @yyanyy for the notification.

Checked the Hive related codepaths and I do not see any issue with the new Id which are not covered by the tests.
We do not persist the schema json. We use the toJson to serialize the schema during the create table call, which should be ok I think.

@rdblue
Copy link
Contributor

rdblue commented May 2, 2021

Overall this looks correct to me. I think it would be nice to replace the Set with Collection in the UpdateSchema API, but that can be done later. Thanks for building this, @jackye1995!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants