Core: add row identifier to schema #2465

jackye1995 · 2021-04-12T23:08:02Z

Continuation of #2354

Spec with row identifier:

{
      "type": "struct",
      "schema-id": 1,
      "row-identifiers": [
        1,
        2
      ],
      "fields": [
        {
          "id": 1,
          "name": "x",
          "required": true,
          "type": "long"
        },
        {
          "id": 2,
          "name": "y",
          "required": true,
          "type": "long",
          "doc": "comment"
        },
        {
          "id": 3,
          "name": "z",
          "required": true,
          "type": "long"
        }
      ]
    }

New Schema toString:

table {
  fields {
    1: x: required long
    2: y: required long (comment)
    3: z: required long
  }
  row identifiers { 1,2 }
}

Update row identifier rules:

row identifier should be added through UpdateSchema.addRowIdentifier(columnName)
the column added should exist in schema or a part of the newly added columns (to make adding a new primary key a single atomic update)
rename, move column should not affect row identifier because it is referencing the field IDs
row identifier should be dropped through UpdateSchema.deleteRowIdentifier(columnName)
it can only drop existing row identifier column
a row identifier column cannot be dropped unless it is first dropped in the row identifiers list, to satisfy both use cases (1) user want to actually drop that row identifier column as an atomic update, (2) prevent user from directly dropping that column without knowing the implications

jackye1995 · 2021-04-12T23:11:33Z

core/src/main/java/org/apache/iceberg/SchemaParser.java

@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
      if (pretty) {
        generator.useDefaultPrettyPrinter();
      }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);


not sure why this method was not updated in the past, is there any risk to use the v2 json writer?

I think this PR forget to refresh this toJson method.

Haven't reviewed the entire PR yet, just for this specific comment - the reason was mentioned in #2096 (comment). Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes schema object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version.

For the case of supporting row identifier to schema, I wonder if we want to do something similar - since row identifier is mostly useful during writing (to check table to get the default row identifier for delete) and not reading (since data/delete file should have this preserved in another place) or any kind of projection (especially that with the current change during projection we will lose row identifier information), to ensure that we don't involve in the complications when working with a mutated schema object I wonder if we want to have a separate toJson/fromJson to handle the row identifier parsing just for table metadata, and once we translate table metadata file to its model, we read row id from table instead of schema with something like table.getRowIdentifier which may help with table serialization/deserialization. But this approach seems to contradict with some aspects in #2354 (comment).

@rdblue since you involved in both conversations, could you please take a look? Thank you!

I think we do want to pass the schema ID here. Whenever we can, we should preserve the ID. My comment on the other PR was recommending that we preserve the ID where possible even when reading and writing manifest files.

I think I missed @yyanyy's follow-up comment about other places that embed the schema ID. I tend to agree with that line of reasoning, but because of it I'm coming to a different conclusion. Rather than removing the schema ID from those places, I think we should try to maintain them.

The risk to keeping schema IDs in other files is that another file may be written with a schema ID that is lost because an older writer (e.g. v1) didn't write out the schema list, only the current schema. If that happens, then an Avro file might write a schema with ID 1, which is then lost and later redefined. But as Yan noted, the actual schema (fields) are stored in those files and we don't need to look up the schema from the ID. So we have a choice of whether to discard the schema ID or not. I'd prefer not to drop it, since we can easily detect whether the schema actually matches the one with its ID and because we won't really need to do the lookup. I think it adds complexity to remember whether or not to drop the ID.

Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. manifest list and delete files. But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well...

I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema (example)

api/src/main/java/org/apache/iceberg/Schema.java

openinx · 2021-04-13T07:15:16Z

core/src/main/java/org/apache/iceberg/SchemaParser.java

@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
      if (pretty) {
        generator.useDefaultPrettyPrinter();
      }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);


I think this PR forget to refresh this toJson method.

openinx · 2021-04-13T07:33:24Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

    Types.StructType struct = TypeUtil
        .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
        .asNestedType().asStructType();
-    return new Schema(struct.fields());
+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())


If we delete column b by using the SchemaUpdate#deleteColumn API, then the struct variable which has been applied by those changes won't include the field that has the colum name b, right ? If sure, then how could we locate the field id for that given column name b in this sentence ?

Yes, that is why I decide to throw exception when deleteColumn is called and the column is a row identifier. There are 2 possible situations:

the user actually wants to delete the column, then he/she should call:

update.deleteRowIdentifier(columnName).deleteColumn(columnName);

the user does not know the column is used as a row identifier, and in that case just calling deleteColumn would result in exception.

Shoud we add the unit tests for above cases ? I mean we are deleting the rowIdentifierField and columns in the same txn .

Yes, we should definitely have a test for this case.

openinx · 2021-04-13T07:47:26Z

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

+      Supplier<C> builder, Consumer<T> appender,
+      Function<JsonNode, T> getter, Consumer<JsonNode> validator,


Those builder, appender, getter and validator makes the method to be really hard to follow, could we just defines a simple visitor interface to visit the JsonNode in this collection and collect all the final result to the upper layer method ?

thanks for the suggestion, please see if the new approach looks better.

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java

yyanyy · 2021-04-13T20:37:49Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())
+        .collect(Collectors.toSet());
+    return new Schema(struct.fields(), refreshedRowIdentifiers);


I think we may want a change to TableMetadata to ensure an update to only row identifier should still create a new schema; I think currently if a schema update does not change the underlying struct, we will consider the operation a no-op (comparison on struct is also why we don't have equals/hashcode in Schema)

thanks! I completely forgot that part... I added a new method sameSchema and now use that to check equality in table metadata.

api/src/main/java/org/apache/iceberg/Schema.java

rdblue · 2021-04-13T22:30:15Z

api/src/main/java/org/apache/iceberg/UpdateSchema.java

+   * @param columnName name of the column to add as a row identifier
+   * @return this for method chaining
+   */
+  UpdateSchema addRowIdentifier(String columnName);


Here also, I don't think that "row identifier" is the right noun. I'd prefer addIdentifierField and removeIdentifierField. I'd use remove rather than delete because it is more clear that the field itself will not be deleted, just removed from the identifier. I also think that we should make sure that behavior (removing from the identifier field list, not dropping the column) is clear in the Javadoc.

What about also adding setIdentifierFields? I think that a bulk set/replace operation fits more cleanly with the use case where users are setting the identifier for the first time or replacing it. I think that this will tend to be idempotent: set the identifier fields to account_id and profile_id.

I'd use remove rather than delete because it is more clear that the field itself will not be deleted

Sounds good, I also called it remove in the doc, delete was trying to match deleteColumn, but I agree it would cause some confusion.

What about also adding setIdentifierFields

I thought about setIdentifierFields, the reason I did not add it was because the UpdateSchema interface seems to be all individual operations instead of bulk operations (except for unionByNameWith), and our DDL SQL statements are also not bulk updates.

I think that this will tend to be idempotent

I think both approaches are idempotent. you can write it as:

update.setIdentifierFields(Sets.newHashSet("profile_id")).apply();

or

update.addIdentifierField("profile_id").removeIdentifierField("account_id").apply();

I am fine with both approaches. Technically if we go with setIdentifierFields, then we do not need addIdentifierField and removeIdentifierField and I can try to figure out the diff for every update.

@openinx do you think there is any advantage for any particular approach for the update API?

Functionally, there should be no difference between the addIdentifierField/removeIdentifierField and setIdentifierFields, but for the perspective of usability, if people want to reset to use a totally different identifier fields then he/she will need to:

Schema schema = ... Set<Integer> newFieldIds = ... UpdateSchema update = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID); for (Integer identifier : schema.rowIdentifiers()) { String columnName = schema.findColumnName(identifier); Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier); update.deleteRowIdentifier(columnName); } for (Integer identifier : schema.rowIdentifiers()) { String columnName = schema.findColumnName(identifier); Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier); update.addRowIdentifier(columnName); } update.apply();

It does make it more difficult to use the API. Besides the setIdentifierFields API, would you think it's necessary to introduce a Set<String> identifierColumns() in Schema ? I find the schema will always return the fieldId while the UpdateSchema will always use the columnName then we have to do the conversion from fieldId to columnName everywhere.

@jackye1995, I don't think that the add/remove example you gave is idempotent because of the check identifierFieldNames.contains(name) that will result in an error: "Cannot remove column from identifier fields account_id because it is not an identifier field". @openinx gives a good example of why it's awkward to make users fall back to remove. And it is easy to make set work internally because you just clear the field ID set before calling add in a loop.

Yes it is not idempotent, and using set to write add/remove is easier than using add/remove to write set, I will change to set then, thanks for the feedbacks.

core/src/main/java/org/apache/iceberg/SchemaParser.java

api/src/main/java/org/apache/iceberg/Schema.java

openinx · 2021-04-14T03:36:47Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

@@ -193,6 +204,11 @@ public UpdateSchema renameColumn(String name, String newName) {
      updates.put(fieldId, Types.NestedField.of(fieldId, field.isOptional(), newName, field.type(), field.doc()));
    }

+    if (identifierFieldNames.contains(name)) {


Should we throw exception if the newName has been already existed in identifierFieldNames (Maybe actually schema) ?

I agree with this. I think it would make sense to check whether newName is already a field in the schema. But that is catching a problem with a rename earlier and isn't really related to this PR.

Yeah, I was expecting the lines above to catch this, but surprisingly it did not do this check. I can put up another small PR to fix it.

rdblue · 2021-04-17T22:21:52Z

api/src/main/java/org/apache/iceberg/Schema.java

+    sb.append("\n  }\n  identifier fields { ");
+    sb.append(COMMA.join(identifierFieldIds()));
+    sb.append(" }\n}");
+    return sb.toString();


Minor: I'd prefer a smaller change to the toString format. Maybe something like adding (id) to identifier fields in the list, rather than adding the identifier fields { ... }.

rdblue · 2021-04-17T22:46:11Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

+    Preconditions.checkArgument(field != null,
+        "Cannot find column of the given name in existing or newly added columns: %s", name);
+    Preconditions.checkArgument(field.isRequired(),
+        "Cannot add column %s as an identifier field because it is not a required field", name);


I don't think that this requirement is correct. In the case where we're adding a column and then making it part of the identifier, values for all of the existing data will be null. I don't see a reason why we can't consider, for example, (1, null) to be an identifier.

The strongest argument for making this a required field is SQL's boolean logic, where null is not equal to itself. But in Iceberg, we translate those expressions so that we can use normal boolean logic throughout the library. Equality deletes consider null equal to itself in Iceberg, just like set operations in SQL use null-safe equality.

Sounds good, I was trying to match the primary key spec in relational SQL, but I agree that in our case of upsert null still makes sense. I will update the logic.

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

rdblue · 2021-04-17T23:36:50Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

+        "Cannot add column %s as an identifier field because it is not a primitive field", name);
+    Preconditions.checkArgument(!identifierFieldNames.contains(name),
+        "Cannot add column %s as an identifier field because it already is", name);
+    identifierFieldNames.add(name);


I think this should also check that the column isn't nested within a map or list because those contain repeated keys, values, or elements. Anything within a repeated field can't be part of an identifier because there is more than one.

rdblue · 2021-04-17T23:41:03Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);


This assumes that the field is not a newly added field within a struct.

I think that this should use the same strategy that we use for addColumn. There are two variations of addColumn, one that accepts a parent name and a field name, and one that accepts only a field name. If the latter is called with a name that contains ., then an exception is thrown because the reference is ambiguous. For example, a.b could be a top-level field named a.b or could be field b nested within field a.

By structuring the methods that way, we always know the parent field and can use that instead of assuming TABLE_ROOT_ID here.

Combining with your last comment, it seems like the best way to go is to just check the column isn't nested in a map/list, and in that case the root will always be TABLE_ROOT_ID. With that check added, I feel there is a bit redundant to have 2 methods like addColumn, because the private addIdentifierField method would always use root as parent. Let me know what you think with the new code.

A column could be nested in a struct without being in a map or a list. For example, if you had a table of HTTP logs, you might have a column of request information with request.path and request.headers, along with a response column that has response.code. It would be valid to refer to response.code because there is only one per row. That's the case where we need the parent so we can distinguish between a top-level response.code and code within the response struct.

I did not realize that we could add a single column name response.code in this addColumn API (Checked the code, sounds like the correct way to add response.code as top-level column is: addColumn(null, "response.code", type, doc) ).

If we wanna to distinguish between the nested columns and the top-level column whose name contains DOT, then we will need to add extra API for this , right ?

// This is used to add columns that does not contains DOT. UpdateSchema setIdentifierFields(Colleciton<String> names); // This is used to add nested columns or top-level columns that contains DOT UpdateSchema setIdentifierFields(Collection<Pair<String, String>> parentAndNames);

@openinx, that's what I was thinking at first as well, but I don't think that we need a version with parent now.

The reason is that once a column is added, it is uniquely identified by a dotted name: a.b might be a struct named a with a field named b or might be a field named a.b, but it cannot be both. We guarantee this so that we never have ambiguous names.

Because we know that the string passed in here will always identify just one field, we don't need to handle the ambiguous case by adding an explicit parent like we do with addColumn.

I think that means we can have a much nicer API:

setIdentifierFields(String...) and setIdentifierFields(Collection<String>)

addIdentifierField(String)

rdblue · 2021-04-17T23:47:54Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

@@ -51,6 +53,7 @@
  private final TableMetadata base;
  private final Schema schema;
  private final Map<Integer, Integer> idToParent;
+  private final Set<String> identifierFieldNames;


Why did you choose to use names instead of ids here? I think IDs would be a little simpler because you wouldn't need to translate in more cases. For example, rename wouldn't need to modify the identifier field ID set because it doesn't change the field IDs. You'd also only need to copy the set of field IDs returned by the current schema in the constructor.

I was looking for a reason why you might have chosen to use names, but I think that it is straightforward to use IDs instead of names in all cases. For example, the check in deleteColumn validates the name passed in directly, which seems easier; but the name has already been looked up in the schema so we have a field and could validate identifierFieldIds.contains(field.fieldId()) instead like the other checks do.

Yes with the set approach the concern that led me to use name is gone, I will change to use ID.

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java

core/src/test/java/org/apache/iceberg/TestTableMetadata.java

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java

rdblue · 2021-04-18T00:19:43Z

Thanks, @jackye1995! I think overall the direction is looking right with identifierFieldIds. I just noted a few things to fix:

Need to add setIdentifierFields
Needs some implementation fixes (track by ID, etc.)
Need to refactor tests into separate cases

jackye1995 · 2021-04-22T00:05:36Z

@rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the identifier fields set, and at applyChange time, deletion of identifier column is checked before generating the new schema struct to validate deletion of identifier field, other validations are done after the new schema struct is generated.

Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using idToParent index. For new names, I need to do something very similar, but also merge the IdToParent with the parent information of newly added fields. So in the end the code was very similar to just creating a new schema and recreating the IdToParent index and a bit redundant. Combining those two facts, I think it is much cleaner to just store everything as name and then validate name and resolve back to ID after the new schema struct is generated.

Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling findField. I think that is a win to not increase complexity in addColumn and avoid new indexes in UpdateSchema.

I can also do the deletes check in the validation function, but in the end I chose to do it in a separated code block before generating the schema struct, because doing it in the validation method requires passing in many arguments, and that flow was not intuitive to code readers, it only increased the complexity of the code and reduced readability.

There are some other methods I tried, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route.

I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach.

openinx · 2021-04-22T07:19:20Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java


  SchemaUpdate(TableOperations ops) {
    this.ops = ops;
    this.base = ops.current();
    this.schema = base.schema();
    this.lastColumnId = base.lastColumnId();
    this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct()));
+    this.identifierNames = schema.identifierFieldIds().stream()


Nit: It may be unrelated to this PR but I think it's good to reuse this two constructor into one so that we could have a simpler version of code:

SchemaUpdate(TableOperations ops) { this(ops, ops.current(), ops.current().schema(), ops.current().lastColumnId()); } /** * For testing only. */ SchemaUpdate(Schema schema, int lastColumnId) { this(null, null, schema, lastColumnId); } private SchemaUpdate(TableOperations ops, TableMetadata base, Schema schema, int lastColumnId) { this.ops = ops; this.base = base; this.schema = schema; this.lastColumnId = lastColumnId; this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct())); this.identifierNames = schema.identifierFieldIds().stream() .map(id -> schema.findField(id).name()) .collect(Collectors.toSet()); }

openinx · 2021-04-22T07:51:19Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

+    Preconditions.checkArgument(field.type().isPrimitiveType(),
+        "Cannot add field %s as an identifier field: not a primitive type field", name);
+    Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct());
+    validateIdentifierFieldParent(field.name(), field.fieldId(), newIdToParent, schema);


I would prefer to use a while loop (rather than a recursive method) to describe the parent field validation logics because that makes more clear:

private static int validateIdentifierField(String name, Schema schema) { Types.NestedField field = schema.findField(name); Preconditions.checkArgument(field != null, "Cannot add field %s as an identifier field, not found in current schema or added columns"); Preconditions.checkArgument(field.type().isPrimitiveType(), "Cannot add field %s as an identifier field: not a primitive type field", name); // Check whether the nested column is in a chain of struct field list. Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct()); Integer parentId = newIdToParent.get(field.fieldId()); while (parentId != null) { Types.NestedField parent = schema.findField(parentId); ValidationException.check(parent.type().isStructType(), "Cannot add field %s as an identifier field: must not be nested in %s", name, parent); parentId = newIdToParent.get(parent.fieldId()); } return field.fieldId(); }

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

openinx · 2021-04-23T03:45:29Z

The current patch looks good to me overall, I just left few minor comments.

jackye1995 · 2021-04-23T06:21:55Z

@openinx thanks, fixed based on your comments

openinx · 2021-04-26T06:15:09Z

@rdblue , would you have any other concern ? If no, I will plan to get this merged. Thanks.

api/src/main/java/org/apache/iceberg/Schema.java

rdblue · 2021-04-26T22:28:11Z

api/src/main/java/org/apache/iceberg/UpdateSchema.java

+   * @param names names of the columns to set as identifier fields
+   * @return this for method chaining
+   */
+  UpdateSchema setIdentifierFields(Set<String> names);


Why require this to be a set instead of accepting any Collection or Iterable? Would a user never need to pass a List?

Set was used to explicitly tell users that identifier fields need to be unique, instead of giving people an illusion that the update operation can still succeed with repeated value.

Technically we can use List, Iterable or Collection. If we would like to make a broader use case for the API, I can document the behavior in javadoc.

I would prefer to use Collection. It's fine to collapse that into a set internally since a set of columns or a multi-set of columns has the same identity behavior. Collection is easier to use for the caller.

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java

rdblue · 2021-04-26T23:28:17Z

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java

+        ))
+        .apply();
+
+    AssertHelpers.assertThrows("add a nested field in struct of a map should fail",


Is this different than the points.element.x case?

This is trying to test struct -> list -> struct, instead of just list -> struct to trigger the while loop.

rdblue · 2021-04-26T23:28:39Z

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java

+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",


Can you add a test for trying to reference a field in a map value struct?

jackye1995 · 2021-04-27T01:56:56Z

@rdblue updated based on your comments. For Set versus List or anything else in the API, please let me know what you think should be the general guidance for this type of change in Iceberg.

yyanyy · 2021-04-29T02:30:18Z

LGTM overall, just want to bring the below comment (that's not really related to the focus of this PR itself) I had originally here to people's attention and make sure we are all aligned, since this PR does change the behavior of what we preserve in metadata fields. If we do go with the current approach, I'll raise a separate PR to mention the usage of schema ID and the caveat in spec.

Also @pvary since hive module uses schema toJson indicated in the comment below.

Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes schema object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version.
...
Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. manifest list and delete files. But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well...

I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema (example)

pvary · 2021-04-29T10:28:33Z

Also @pvary since hive module uses schema toJson indicated in the comment below.

Thanks @yyanyy for the notification.

Checked the Hive related codepaths and I do not see any issue with the new Id which are not covered by the tests.
We do not persist the schema json. We use the toJson to serialize the schema during the create table call, which should be ok I think.

rdblue · 2021-05-02T20:52:37Z

Overall this looks correct to me. I think it would be nice to replace the Set with Collection in the UpdateSchema API, but that can be done later. Thanks for building this, @jackye1995!

github-actions bot added API core labels Apr 12, 2021

jackye1995 commented Apr 12, 2021

View reviewed changes

openinx reviewed Apr 13, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Show resolved Hide resolved

yyanyy reviewed Apr 13, 2021

View reviewed changes

yyanyy mentioned this pull request Apr 13, 2021

Core: add schema id to snapshot #2275

Merged

rdblue reviewed Apr 13, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/Schema.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 13, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SchemaParser.java Outdated Show resolved Hide resolved

openinx reviewed Apr 14, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/Schema.java Show resolved Hide resolved

openinx reviewed Apr 14, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/Schema.java Show resolved Hide resolved

openinx reviewed Apr 14, 2021

View reviewed changes

rdblue reviewed Apr 17, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 17, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/util/JsonUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/util/JsonUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestTableMetadata.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 18, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

Jack Ye added 2 commits April 21, 2021 17:17

update doc

b04f664

update error message

7d7c4bf

openinx reviewed Apr 22, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SchemaUpdate.java Show resolved Hide resolved

fix based on comments

ba503f0

openinx approved these changes Apr 24, 2021

View reviewed changes

rdblue reviewed Apr 26, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/Schema.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 26, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 26, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 26, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 26, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 26, 2021

View reviewed changes

Jack Ye added 2 commits April 26, 2021 17:53

fix based on comments

8f52066

update variable name

5e9393d

dixingxing0 mentioned this pull request Apr 27, 2021

Should fresh IdentifierFieldIds (Follow-up for PR2465) #2528

Closed

rdblue approved these changes May 2, 2021

View reviewed changes

rdblue merged commit e58d28d into apache:master May 2, 2021

jackye1995 mentioned this pull request Jun 15, 2021

Update spec for v2 changes #2654

Merged

openinx mentioned this pull request Dec 28, 2021

Support iceberg reader for hive catalog and data stored on HDFS StarRocks/starrocks#2225

Merged

		Supplier<C> builder, Consumer<T> appender,
		Function<JsonNode, T> getter, Consumer<JsonNode> validator,

Core: add row identifier to schema #2465

Core: add row identifier to schema #2465

Conversation

jackye1995 commented Apr 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yyanyy Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Apr 18, 2021

jackye1995 commented Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openinx commented Apr 23, 2021

jackye1995 commented Apr 23, 2021

openinx commented Apr 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 commented Apr 27, 2021

yyanyy commented Apr 29, 2021

pvary commented Apr 29, 2021

rdblue commented May 2, 2021

jackye1995 commented Apr 12, 2021 •

edited

Loading

yyanyy Apr 13, 2021 •

edited

Loading

jackye1995 Apr 19, 2021 •

edited

Loading

rdblue Apr 21, 2021 •

edited

Loading

jackye1995 commented Apr 22, 2021 •

edited

Loading