-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: add row identifier to schema #2465
Conversation
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) { | |||
if (pretty) { | |||
generator.useDefaultPrettyPrinter(); | |||
} | |||
toJson(schema.asStruct(), generator); | |||
toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why this method was not updated in the past, is there any risk to use the v2 json writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR forget to refresh this toJson
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed the entire PR yet, just for this specific comment - the reason was mentioned in #2096 (comment). Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes schema
object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version.
For the case of supporting row identifier to schema, I wonder if we want to do something similar - since row identifier is mostly useful during writing (to check table to get the default row identifier for delete) and not reading (since data/delete file should have this preserved in another place) or any kind of projection (especially that with the current change during projection we will lose row identifier information), to ensure that we don't involve in the complications when working with a mutated schema object I wonder if we want to have a separate toJson
/fromJson
to handle the row identifier parsing just for table metadata, and once we translate table metadata file to its model, we read row id from table instead of schema with something like table.getRowIdentifier
which may help with table serialization/deserialization. But this approach seems to contradict with some aspects in #2354 (comment).
@rdblue since you involved in both conversations, could you please take a look? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want to pass the schema ID here. Whenever we can, we should preserve the ID. My comment on the other PR was recommending that we preserve the ID where possible even when reading and writing manifest files.
I think I missed @yyanyy's follow-up comment about other places that embed the schema ID. I tend to agree with that line of reasoning, but because of it I'm coming to a different conclusion. Rather than removing the schema ID from those places, I think we should try to maintain them.
The risk to keeping schema IDs in other files is that another file may be written with a schema ID that is lost because an older writer (e.g. v1) didn't write out the schema list, only the current schema. If that happens, then an Avro file might write a schema with ID 1, which is then lost and later redefined. But as Yan noted, the actual schema (fields) are stored in those files and we don't need to look up the schema from the ID. So we have a choice of whether to discard the schema ID or not. I'd prefer not to drop it, since we can easily detect whether the schema actually matches the one with its ID and because we won't really need to do the lookup. I think it adds complexity to remember whether or not to drop the ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. manifest list and delete files. But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well...
I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema (example)
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) { | |||
if (pretty) { | |||
generator.useDefaultPrettyPrinter(); | |||
} | |||
toJson(schema.asStruct(), generator); | |||
toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR forget to refresh this toJson
method.
Types.StructType struct = TypeUtil | ||
.visit(schema, new ApplyChanges(deletes, updates, adds, moves)) | ||
.asNestedType().asStructType(); | ||
return new Schema(struct.fields()); | ||
Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream() | ||
.map(name -> struct.field(name).fieldId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we delete column b
by using the SchemaUpdate#deleteColumn
API, then the struct
variable which has been applied by those changes won't include the field that has the colum name b
, right ? If sure, then how could we locate the field id for that given column name b
in this sentence ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is why I decide to throw exception when deleteColumn
is called and the column is a row identifier. There are 2 possible situations:
- the user actually wants to delete the column, then he/she should call:
update.deleteRowIdentifier(columnName).deleteColumn(columnName);
- the user does not know the column is used as a row identifier, and in that case just calling
deleteColumn
would result in exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoud we add the unit tests for above cases ? I mean we are deleting the rowIdentifierField and columns in the same txn .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should definitely have a test for this case.
Supplier<C> builder, Consumer<T> appender, | ||
Function<JsonNode, T> getter, Consumer<JsonNode> validator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those builder
, appender
, getter
and validator
makes the method to be really hard to follow, could we just defines a simple visitor
interface to visit the JsonNode
in this collection and collect all the final result to the upper layer method ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, please see if the new approach looks better.
Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream() | ||
.map(name -> struct.field(name).fieldId()) | ||
.collect(Collectors.toSet()); | ||
return new Schema(struct.fields(), refreshedRowIdentifiers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want a change to TableMetadata
to ensure an update to only row identifier should still create a new schema; I think currently if a schema update does not change the underlying struct
, we will consider the operation a no-op (comparison on struct
is also why we don't have equals
/hashcode
in Schema
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I completely forgot that part... I added a new method sameSchema
and now use that to check equality in table metadata.
* @param columnName name of the column to add as a row identifier | ||
* @return this for method chaining | ||
*/ | ||
UpdateSchema addRowIdentifier(String columnName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also, I don't think that "row identifier" is the right noun. I'd prefer addIdentifierField
and removeIdentifierField
. I'd use remove
rather than delete
because it is more clear that the field itself will not be deleted, just removed from the identifier. I also think that we should make sure that behavior (removing from the identifier field list, not dropping the column) is clear in the Javadoc.
What about also adding setIdentifierFields
? I think that a bulk set/replace operation fits more cleanly with the use case where users are setting the identifier for the first time or replacing it. I think that this will tend to be idempotent: set the identifier fields to account_id
and profile_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use remove rather than delete because it is more clear that the field itself will not be deleted
Sounds good, I also called it remove
in the doc, delete
was trying to match deleteColumn
, but I agree it would cause some confusion.
What about also adding setIdentifierFields
I thought about setIdentifierFields
, the reason I did not add it was because the UpdateSchema
interface seems to be all individual operations instead of bulk operations (except for unionByNameWith
), and our DDL SQL statements are also not bulk updates.
I think that this will tend to be idempotent
I think both approaches are idempotent. you can write it as:
update.setIdentifierFields(Sets.newHashSet("profile_id")).apply();
or
update.addIdentifierField("profile_id").removeIdentifierField("account_id").apply();
I am fine with both approaches. Technically if we go with setIdentifierFields
, then we do not need addIdentifierField
and removeIdentifierField
and I can try to figure out the diff for every update.
@openinx do you think there is any advantage for any particular approach for the update API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, there should be no difference between the addIdentifierField
/removeIdentifierField
and setIdentifierFields
, but for the perspective of usability, if people want to reset to use a totally different identifier fields then he/she will need to:
Schema schema = ...
Set<Integer> newFieldIds = ...
UpdateSchema update = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID);
for (Integer identifier : schema.rowIdentifiers()) {
String columnName = schema.findColumnName(identifier);
Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier);
update.deleteRowIdentifier(columnName);
}
for (Integer identifier : schema.rowIdentifiers()) {
String columnName = schema.findColumnName(identifier);
Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier);
update.addRowIdentifier(columnName);
}
update.apply();
It does make it more difficult to use the API. Besides the setIdentifierFields
API, would you think it's necessary to introduce a Set<String> identifierColumns()
in Schema
? I find the schema will always return the fieldId
while the UpdateSchema
will always use the columnName
then we have to do the conversion from fieldId
to columnName
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995, I don't think that the add/remove example you gave is idempotent because of the check identifierFieldNames.contains(name)
that will result in an error: "Cannot remove column from identifier fields account_id because it is not an identifier field". @openinx gives a good example of why it's awkward to make users fall back to remove
. And it is easy to make set
work internally because you just clear the field ID set before calling add
in a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is not idempotent, and using set to write add/remove is easier than using add/remove to write set, I will change to set then, thanks for the feedbacks.
@@ -193,6 +204,11 @@ public UpdateSchema renameColumn(String name, String newName) { | |||
updates.put(fieldId, Types.NestedField.of(fieldId, field.isOptional(), newName, field.type(), field.doc())); | |||
} | |||
|
|||
if (identifierFieldNames.contains(name)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw exception if the newName
has been already existed in identifierFieldNames
(Maybe actually schema
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. I think it would make sense to check whether newName
is already a field in the schema. But that is catching a problem with a rename earlier and isn't really related to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was expecting the lines above to catch this, but surprisingly it did not do this check. I can put up another small PR to fix it.
sb.append("\n }\n identifier fields { "); | ||
sb.append(COMMA.join(identifierFieldIds())); | ||
sb.append(" }\n}"); | ||
return sb.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I'd prefer a smaller change to the toString
format. Maybe something like adding (id)
to identifier fields in the list, rather than adding the identifier fields { ... }
.
Preconditions.checkArgument(field != null, | ||
"Cannot find column of the given name in existing or newly added columns: %s", name); | ||
Preconditions.checkArgument(field.isRequired(), | ||
"Cannot add column %s as an identifier field because it is not a required field", name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this requirement is correct. In the case where we're adding a column and then making it part of the identifier, values for all of the existing data will be null. I don't see a reason why we can't consider, for example, (1, null)
to be an identifier.
The strongest argument for making this a required field is SQL's boolean logic, where null
is not equal to itself. But in Iceberg, we translate those expressions so that we can use normal boolean logic throughout the library. Equality deletes consider null equal to itself in Iceberg, just like set operations in SQL use null-safe equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I was trying to match the primary key spec in relational SQL, but I agree that in our case of upsert null still makes sense. I will update the logic.
"Cannot add column %s as an identifier field because it is not a primitive field", name); | ||
Preconditions.checkArgument(!identifierFieldNames.contains(name), | ||
"Cannot add column %s as an identifier field because it already is", name); | ||
identifierFieldNames.add(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also check that the column isn't nested within a map or list because those contain repeated keys, values, or elements. Anything within a repeated field can't be part of an identifier because there is more than one.
if (field == null) { | ||
field = adds.get(TABLE_ROOT_ID).stream() | ||
.filter(f -> f.name().equals(name)) | ||
.findAny().orElse(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that the field is not a newly added field within a struct.
I think that this should use the same strategy that we use for addColumn
. There are two variations of addColumn
, one that accepts a parent name and a field name, and one that accepts only a field name. If the latter is called with a name that contains .
, then an exception is thrown because the reference is ambiguous. For example, a.b
could be a top-level field named a.b
or could be field b
nested within field a
.
By structuring the methods that way, we always know the parent field and can use that instead of assuming TABLE_ROOT_ID
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combining with your last comment, it seems like the best way to go is to just check the column isn't nested in a map/list, and in that case the root will always be TABLE_ROOT_ID. With that check added, I feel there is a bit redundant to have 2 methods like addColumn
, because the private addIdentifierField
method would always use root as parent. Let me know what you think with the new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A column could be nested in a struct without being in a map or a list. For example, if you had a table of HTTP logs, you might have a column of request
information with request.path
and request.headers
, along with a response
column that has response.code
. It would be valid to refer to response.code
because there is only one per row. That's the case where we need the parent so we can distinguish between a top-level response.code
and code
within the response
struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not realize that we could add a single column name response.code
in this addColumn
API (Checked the code, sounds like the correct way to add response.code
as top-level column is: addColumn(null, "response.code", type, doc)
).
If we wanna to distinguish between the nested columns and the top-level column whose name contains DOT, then we will need to add extra API for this , right ?
// This is used to add columns that does not contains DOT.
UpdateSchema setIdentifierFields(Colleciton<String> names);
// This is used to add nested columns or top-level columns that contains DOT
UpdateSchema setIdentifierFields(Collection<Pair<String, String>> parentAndNames);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@openinx, that's what I was thinking at first as well, but I don't think that we need a version with parent now.
The reason is that once a column is added, it is uniquely identified by a dotted name: a.b
might be a struct named a
with a field named b
or might be a field named a.b
, but it cannot be both. We guarantee this so that we never have ambiguous names.
Because we know that the string passed in here will always identify just one field, we don't need to handle the ambiguous case by adding an explicit parent like we do with addColumn
.
I think that means we can have a much nicer API:
setIdentifierFields(String...)
andsetIdentifierFields(Collection<String>)
addIdentifierField(String)
@@ -51,6 +53,7 @@ | |||
private final TableMetadata base; | |||
private final Schema schema; | |||
private final Map<Integer, Integer> idToParent; | |||
private final Set<String> identifierFieldNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you choose to use names instead of ids here? I think IDs would be a little simpler because you wouldn't need to translate in more cases. For example, rename wouldn't need to modify the identifier field ID set because it doesn't change the field IDs. You'd also only need to copy the set of field IDs returned by the current schema in the constructor.
I was looking for a reason why you might have chosen to use names, but I think that it is straightforward to use IDs instead of names in all cases. For example, the check in deleteColumn
validates the name passed in directly, which seems easier; but the name has already been looked up in the schema so we have a field and could validate identifierFieldIds.contains(field.fieldId())
instead like the other checks do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes with the set approach the concern that led me to use name is gone, I will change to use ID.
Thanks, @jackye1995! I think overall the direction is looking right with
|
@rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the identifier fields set, and at Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling I can also do the There are some other methods I tried, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route. I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach. |
|
||
SchemaUpdate(TableOperations ops) { | ||
this.ops = ops; | ||
this.base = ops.current(); | ||
this.schema = base.schema(); | ||
this.lastColumnId = base.lastColumnId(); | ||
this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct())); | ||
this.identifierNames = schema.identifierFieldIds().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It may be unrelated to this PR but I think it's good to reuse this two constructor into one so that we could have a simpler version of code:
SchemaUpdate(TableOperations ops) {
this(ops, ops.current(), ops.current().schema(), ops.current().lastColumnId());
}
/**
* For testing only.
*/
SchemaUpdate(Schema schema, int lastColumnId) {
this(null, null, schema, lastColumnId);
}
private SchemaUpdate(TableOperations ops, TableMetadata base, Schema schema, int lastColumnId) {
this.ops = ops;
this.base = base;
this.schema = schema;
this.lastColumnId = lastColumnId;
this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct()));
this.identifierNames = schema.identifierFieldIds().stream()
.map(id -> schema.findField(id).name())
.collect(Collectors.toSet());
}
Preconditions.checkArgument(field.type().isPrimitiveType(), | ||
"Cannot add field %s as an identifier field: not a primitive type field", name); | ||
Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct()); | ||
validateIdentifierFieldParent(field.name(), field.fieldId(), newIdToParent, schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use a while loop (rather than a recursive method) to describe the parent field validation logics because that makes more clear:
private static int validateIdentifierField(String name, Schema schema) {
Types.NestedField field = schema.findField(name);
Preconditions.checkArgument(field != null,
"Cannot add field %s as an identifier field, not found in current schema or added columns");
Preconditions.checkArgument(field.type().isPrimitiveType(),
"Cannot add field %s as an identifier field: not a primitive type field", name);
// Check whether the nested column is in a chain of struct field list.
Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct());
Integer parentId = newIdToParent.get(field.fieldId());
while (parentId != null) {
Types.NestedField parent = schema.findField(parentId);
ValidationException.check(parent.type().isStructType(),
"Cannot add field %s as an identifier field: must not be nested in %s", name, parent);
parentId = newIdToParent.get(parent.fieldId());
}
return field.fieldId();
}
The current patch looks good to me overall, I just left few minor comments. |
@openinx thanks, fixed based on your comments |
@rdblue , would you have any other concern ? If no, I will plan to get this merged. Thanks. |
* @param names names of the columns to set as identifier fields | ||
* @return this for method chaining | ||
*/ | ||
UpdateSchema setIdentifierFields(Set<String> names); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why require this to be a set instead of accepting any Collection
or Iterable
? Would a user never need to pass a List
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set
was used to explicitly tell users that identifier fields need to be unique, instead of giving people an illusion that the update operation can still succeed with repeated value.
Technically we can use List
, Iterable
or Collection
. If we would like to make a broader use case for the API, I can document the behavior in javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use Collection
. It's fine to collapse that into a set internally since a set of columns or a multi-set of columns has the same identity behavior. Collection
is easier to use for the caller.
)) | ||
.apply(); | ||
|
||
AssertHelpers.assertThrows("add a nested field in struct of a map should fail", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different than the points.element.x
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is trying to test struct -> list -> struct, instead of just list -> struct to trigger the while loop.
.setIdentifierFields(Sets.newHashSet("locations")) | ||
.apply()); | ||
|
||
AssertHelpers.assertThrows("add a nested field in map should fail", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for trying to reference a field in a map value struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@rdblue updated based on your comments. For |
LGTM overall, just want to bring the below comment (that's not really related to the focus of this PR itself) I had originally here to people's attention and make sure we are all aligned, since this PR does change the behavior of what we preserve in metadata fields. If we do go with the current approach, I'll raise a separate PR to mention the usage of schema ID and the caveat in spec. Also @pvary since hive module uses schema
|
Thanks @yyanyy for the notification. Checked the Hive related codepaths and I do not see any issue with the new Id which are not covered by the tests. |
Overall this looks correct to me. I think it would be nice to replace the |
Continuation of #2354
@yyanyy @rdblue @openinx @aokolnychyi
Spec with row identifier:
New Schema toString:
Update row identifier rules:
UpdateSchema.addRowIdentifier(columnName)
UpdateSchema.deleteRowIdentifier(columnName)