Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Validate conflicting delete files in RowDelta and OverwriteFiles #3069

Closed

Conversation

aokolnychyi
Copy link
Contributor

This PR adds validation for concurrently added delete files in RowDelta and OverwriteFiles. Previously, if we had a conflicting row delta operation, we would ignore it and corrupt the table.

@@ -64,7 +64,7 @@
private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
// delete files can be added in "overwrite" or "delete" operations
private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =
private static final Set<String> VALIDATE_ADDED_DELETE_FILES_OPERATIONS =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to match VALIDATE_ADDED_FILES_OPERATIONS .

@aokolnychyi aokolnychyi force-pushed the validate-new-delete-files branch from 81212ce to d34be6f Compare September 3, 2021 18:50
@aokolnychyi
Copy link
Contributor Author

@rdblue rdblue added this to the Java 0.12.1 Release milestone Sep 3, 2021
@@ -122,27 +122,30 @@
*
* @param conflictDetectionFilter an expression on rows in the table
* @return this for method chaining
* @deprecated this will be removed in 0.14.0;
* use {@link #validateNoConflictingOperations(Expression)} instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to update ReplacePartitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check during the weekend

Copy link
Contributor Author

@aokolnychyi aokolnychyi Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we would do anything differently in ReplacePartitions if anyone commits delete files concurrently. Any particular thoughts you had in mind, @rdblue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho is adding support for detecting conflicts in ReplacePartitions. The original behavior was basically snapshot isolation like Hive. But you could argue that it would be valuable to support serializable by ensuring the replaced partitions haven't changed since some starting snapshot. This is the PR: #2925

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about that PR. I'll take a look on Monday.

if (conflictDetectionFilter != null) {
validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
}

if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
Copy link
Contributor Author

@aokolnychyi aokolnychyi Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to use validateNoConflictingAppends here as RowDelta works with paths only.
What about using DataFile in validateDataFilesExist, @rdblue?
Any particular reasons for using paths only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will be able to use DataFile for RowDelta that easily as we are using a regular scan and delete writers just give us referenced data file locations. Then I'd say we probably still need the new method here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the motivation for updating RowDelta is the case where we have two concurrent delta commits? So an UPDATE and a MERGE at the same time might both rewrite a row, which could cause a duplicate:

INSERT INTO t VALUES (1, 'a'), (2, 'b'), (3, 'c');

-- running these concurrently causes a problem
UPDATE t SET data = 'x' WHERE id = 1;
UPDATE t SET data = 'y' WHERE id = 1;

If I ran the updates concurrently, both would delete id=1 and both would add a new file with (1, 'x') and (1, 'y') right?

The validation here is that the file created by the initial insert doesn't have any new delete files written against it. It seems like we want to just call validateNoNewDeletesForDataFiles and pass referencedFiles in, right? Maybe I'm missing something?

We might want to make this a separate issue to keep changes smaller and reviews easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only know locations of referenced files. We need actual DataFile objects to leverage the delete file index. My question is why we went with locations only in RowDelta and whether we should switch to DataFile objects instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think we could approach it differently. Let me update and then we can discuss more.

Copy link
Contributor

@rdblue rdblue Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why we only have file locations in RowDelta is that we get the set of referenced files from writing position deletes, which are (location, position). We don't require carrying DataFile through delta operations.

We could probably wrap the location in a dummy DataFile, or add a way to query the DeleteFileIndex by location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, if we create a DataFile, we can copy the column stats from the DeleteFile so that we can do stats overlap comparisons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a chance to think more about the required validation. While using DataFiles instead of locations would give us min/max filtering, we can probably do better than that if all delete files are position-based.

We can do something like this if we are working with position-based delete files:

  • Read all position deletes we are trying to commit and build a map with file location -> a list of deleted positions.
  • Iterate through concurrently added delete files applying min/max filtering (secondary indexes in the future) and find which may have conflicts.
  • Read files that potentially conflict and verify that (file, pos) pairs don't overlap.

We may need to add some limit on the overall size of deletes we can scan but we can figure that out.
Overall, this will allow us to resolve conflicts within the same partition.

*/
@Deprecated
OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
OverwriteFiles validateNoConflictingOperations(Expression conflictDetectionFilter);
Copy link
Member

@openinx openinx Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this method is designed to validate whether the copy-on-write batch overwrite operations will be conflicted with the RowDelta operations ? I will need some time to read the whole copy-on-write code path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, any operations adding the data/delete files that match the scan files from copy-on-write should be conflicted.

@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {

if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should the OverwriteFiles also conflict with the REPLACE operation when checking the added data files in validateAddedDataFiles ? Take the following example:

  1. The table has 2 data files: FILE_A, FILE_B;
  2. Start the CopyOnWrite to delete all rows in FILE_A. Let's say txn1, it's just started but not committed;
  3. Someone start the REPLACE txn to rewrite the FILE_A + FILE_B to FILE_C. Let's say txn2 , started but not committed;
  4. Committed the txn2;
  5. Committed the txn1

Finally, we will see all the rows from FILE_A again because the REPLACE operation has added them to table again.

I raise this question because I see the validateAddedDataFiles only check the APPEND & OVERWRITE operations ( it's VALIDATE_ADDED_FILES_OPERATIONS ). So REPLACE operations should be also considered ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case is already covered as we always guarantee all files we overwrite in OverwriteFiles still exist. It is literally new records (added, not rewritten) we trying to catch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example above, txn1 will fail as FILE_A is not there.

* @param conflictDetectionFilter an expression used to find new conflicting delete files
* @param caseSensitive whether expression evaluation should be case sensitive
*/
protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this to validateNoAddedDeleteFiles or something similar? It isn't clear from the method name that this is checking that there aren't any. We may want to rename validateAddedDataFiles as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for validateNoNewDeletes to match validateNoNewDeletesForDataFiles.

@aokolnychyi aokolnychyi force-pushed the validate-new-delete-files branch from d34be6f to fca719f Compare September 13, 2021 18:50
}

if (deleteConflictDetectionFilter != null) {
validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is quite trivial. For example, we won't be able to resolve conflicts within the same partition. I have outlined a way to optimize it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean it cannot resolve within same data file (I thought we are passing data filter)? Or within the same partition?

And also for my learning, you mean it will be over-aggressive and report false negatives even if rows do not actually conflict, until we make the optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean it will be over-aggressive and report false negatives even if rows do not actually conflict, until we make the optimization.

Yeah, it may report false positives. The data filter is helpful but I think it won't help much within the same partition. Position deletes are scoped to a partition so the data filter should help us when there is a concurrent delete in another partition. Within the partition, though, most of position deletes will match that row filter as we don't persist the deleted row (by default).

Copy link
Contributor

@jackye1995 jackye1995 Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit late to the whole discussion. Regarding the check, I read the outlined way to optimize it, just want to share some thoughts based on what I am doing for position deletes of my internal distribution as of today.

In my system, each position delete file written contains exactly 1 file_path value, which avoids the requirement from the spec to sort by file path and also greatly simplifies the validation during concurrent commits, because each check can easily find all position deletes of each data file and check against just the position min max to see if there is any potential overlapping of the position range. Of course this cannot be applied to a general use case, it was implemented just to see what can be achieved with a closed system where all delete writers only write that specific type of position delete file.

When I started to compact position delete files to contain multiple file_path values, it becomes very easy to have false-positives, especially in the object storage mode where the file_path min and max does not really mean anything anymore. So at least from the object storage use case, secondary index with much better file skipping ability is a must have to make the strategy described truly work efficiently.

@aokolnychyi aokolnychyi force-pushed the validate-new-delete-files branch from fca719f to c1d0229 Compare September 13, 2021 19:56
Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly great, just a few questions

* @param conflictDetectionFilter an expression on rows in the table
* @return this for method chaining
*/
RowDelta validateNoConflictingDeleteFiles(Expression conflictDetectionFilter);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we make it a bit consistent with above? (ie, omit 'files' from the name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that in the first place but then I started to worry it may be confusing. For example, we refer here to concurrently added delete files vs concurrently happened delete operations that removed data files.

I do prefer consistency too but I am not sure whether it is confusing. What do you think, @szehon-ho?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fine with me then, thanks for clarifying

long startingSequenceNumber = startingSequenceNumber(base, startingSnapshotId);
DeleteFileIndex deletes = buildDeleteFileIndex(deleteManifests, startingSequenceNumber, dataFilter, caseSensitive);

ValidationException.check(deletes.isEmpty(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

}

if (deleteConflictDetectionFilter != null) {
validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean it cannot resolve within same data file (I thought we are passing data filter)? Or within the same partition?

And also for my learning, you mean it will be over-aggressive and report false negatives even if rows do not actually conflict, until we make the optimization.

@@ -86,6 +86,20 @@ public boolean isEmpty() {
return (globalDeletes == null || globalDeletes.length == 0) && sortedDeletesByPartition.isEmpty();
}

public List<DeleteFile> referencedDeleteFiles() {
List<DeleteFile> deleteFiles = Lists.newArrayList();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional comment: small optimization can be done by knowing the initial length, and checking isEmpty

if (isEmpty()) {
   return Lists.empty()
else
   List<DeleteFile> deleteFiles = Lists.newArrayList(globalDeletes.length + sortedDeletesByPartition.length);
   ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to implement this but then I realized sortedDeletesByPartition.length is not accurate as each entry contains an array of delete files. In order to compute a right estimate, I'd need to iterate through the map.

// we must fail this operation as it would undelete rows that were removed concurrently
if (deletedDataFiles.size() > 0) {
validateNoNewDeletesForDataFiles(
base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, caseSensitive);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I agree that we only need to check the files that were directly deleted.

For example, you can use Spark to replace an expression using DataFrameWriterV2:

df.writeTo(t).overwrite($"date" === "2021-10-01")

Whether the deleted files should be checked for row-level deletes depends on whether the written df is a result of reading the table. If you're performing your own merge, then it should be. I realize that there's not currently a way to enable the validation, but we could support write properties for this eventually:

df.writeTo(t)
    .option("validate-snapshot-id", 1234L) // use serializable isolation
    .overwrite($"date" === "2021-10-01")

Is this something that we don't need to support right now because no one is calling it? I think we should still consider adding the validation for later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, this is a way to add serializable isolation for replacing partitions as well.

@szehon-ho ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if it is something we want to support, it means there are 3 cases:

  • Case 1: copy-on-write MERGE -> we must validate deletes.
  • Case 2: overwrite by filter with serializable isolation -> we must validate deletes.
  • Case 3: overwrite by filter with snapshot isolation -> we must NOT validate deletes.

Since delete validation becomes optional, it probably means we need a separate method to trigger it.
Thoughts, @rdblue @szehon-ho?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, OverwriteFiles will now match RowDelta that has a separate method to trigger the validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR to match what I described above.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about ready, but I'd like to see an additional validation in OverwriteFiles that can run the new validateNoNewDeletes that is used in RowDelta. I gave an example case and it would be good to hear whether you agree it's valid.

@aokolnychyi aokolnychyi force-pushed the validate-new-delete-files branch from 57d99ba to 078aa70 Compare September 21, 2021 01:38
@aokolnychyi
Copy link
Contributor Author

Ready for another round.

@@ -365,6 +365,7 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,

Expression conflictDetectionFilter = conflictDetectionFilter();
overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);
overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, could you double check this place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check looks good to me.

}

Expression conflictDetectionFilter = conflictDetectionFilter();
overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);
Copy link
Collaborator

@szehon-ho szehon-ho Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mistaken, but why are we check for conflicting delete files in snapshot isolation? (I thought we must not check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider we have a data file A and there is a copy-on-write operation that overwrites it with a data file B. If a concurrent operation adds a delete file that references records from the data file A, committing the original copy-on-write operation (i.e. overwrite) would undelete the rows that were deleted concurrently.

It seems we always have to validate the delete files whenever we overwrite specific files during DELETE/MERGE.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, yea I double-checked the Postgres doc for REPEATABLE_READ (snapshot) and it makes sense, deletes and updates to same rows must not be overridden in this mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Deletes should always be validated.

@@ -95,11 +101,18 @@ public OverwriteFiles caseSensitive(boolean isCaseSensitive) {
@Override
public OverwriteFiles validateNoConflictingAppends(Expression newConflictDetectionFilter) {
Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should this be Append conflict detection filter cannot be null now that we have both appendConflictDetectionFilter and delteConflictDectionFilter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update if it fits on the same line.

failMissingDeletePaths();
return this;
}

@Override
public OverwriteFiles validateNoConflictingDeleteFiles(Expression newConflictDetectionFilter) {
Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Same comment about saying "Delete conflict detection filter cannot be null` instead of leaving it unqualified and ambiguous.

@@ -95,11 +101,18 @@ public OverwriteFiles caseSensitive(boolean isCaseSensitive) {
@Override
public OverwriteFiles validateNoConflictingAppends(Expression newConflictDetectionFilter) {
Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
this.conflictDetectionFilter = newConflictDetectionFilter;
this.appendConflictDetectionFilter = newConflictDetectionFilter;
failMissingDeletePaths();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does this call to failMissingDeletePaths() still belong here or should it be moved to the validateNoConflictingDeleteFiles call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check.

Snapshot startingSnapshot = metadata.snapshot(staringSnapshotId);
return startingSnapshot.sequenceNumber();
} else {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use TableMetadata.INITIAL_SEQUENCE_NUMBER and remove the comment

overwriteFiles.validateFromSnapshot(scanSnapshotId);
}

Expression conflictDetectionFilter = conflictDetectionFilter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can combine L384 and L385, conflictDetectionFilter only used once

}

if (deleteConflictDetectionFilter != null) {
validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);
Copy link
Contributor

@jackye1995 jackye1995 Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit late to the whole discussion. Regarding the check, I read the outlined way to optimize it, just want to share some thoughts based on what I am doing for position deletes of my internal distribution as of today.

In my system, each position delete file written contains exactly 1 file_path value, which avoids the requirement from the spec to sort by file path and also greatly simplifies the validation during concurrent commits, because each check can easily find all position deletes of each data file and check against just the position min max to see if there is any potential overlapping of the position range. Of course this cannot be applied to a general use case, it was implemented just to see what can be achieved with a closed system where all delete writers only write that specific type of position delete file.

When I started to compact position delete files to contain multiple file_path values, it becomes very easy to have false-positives, especially in the object storage mode where the file_path min and max does not really mean anything anymore. So at least from the object storage use case, secondary index with much better file skipping ability is a must have to make the strategy described truly work efficiently.

* <p>
* This method must be called when the table is queried to produce a row delta for UPDATE and
* MERGE operations independently of the isolation level. Calling this method isn't required
* for DELETE operations as it is OK when a particular record we are trying to delete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use of "we" in javadoc is unnecessary. It is simpler to say "it is OK to delete a record that is also deleted concurrently".

validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
}

boolean validateNewDeletes = deleteConflictDetectionFilter != null && base.currentSnapshot() != null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the behavior here should be slightly different. There are two concerns: 1) whether to check delete files for snapshot isolation and 2) what conflict detection filter to use. Basing validateNewDeletes on whether the conflict detection filter was set doesn't seem correct to me.

I don't think there is a case where we don't want to validate delete files if we have called validateFromSnapshot to set the base snapshot. I think that we should add this as a boolean field that is set when validateFromSnapshot is called.

Then, if we are validating delete files, we should have two separate checks. First, if there are any files in deletedDataFiles, then we perform the validation below. If the conflict detection filter wasn't set, then we should use Expressions.alwaysTrue to find candidate delete files. Second, if an overwrite filter was set, then we should run validateNoNewDeletes with either the delete filter or the delete conflict detection filter. The conflict detection filter should be an optimization, not a way to turn off delete validations.

I think that makes the API more understandable and consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I changed this to locally while thinking through it:

    // validateDeletes is set to true in validateFromSnapshot. Maybe we should default it if that method isn't called?
    if (validateDeletes) {
      if (deletedDataFiles.size() > 0) {
        validateNoNewDeletesForDataFiles(
            base, startingSnapshotId, deleteConflictDetectionFilter,
            deletedDataFiles, caseSensitive);
      }

      if (rowFilter() != Expressions.alwaysFalse()) {
        if (deleteConflictDetectionFilter != null) {
          validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);
        } else {
          validateNoNewDeletes(base, startingSnapshotId, rowFilter(), caseSensitive);
        }
      }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basing validateNewDeletes on whether the conflict detection filter was set doesn't seem correct to me.

If I got you correctly, you are proposing that validateFromSnapshot will now indicate whether we should validate delete files. I think that is different compared to how RowDelta and OverwriteFiles work right now. I'd actually say calling validateFromSnapshot is an optimization that tells us from which snapshot to start looking. We never validate new appends if the append conflict detection filter is null. Moreover, it is not always possible to set the starting snapshot. If we start on an empty table, we must validate all snapshots. Here is our copy-on-write commit logic.

Long scanSnapshotId = scan.snapshotId();
if (scanSnapshotId != null) {
  overwriteFiles.validateFromSnapshot(scanSnapshotId);
}

Expression conflictDetectionFilter = conflictDetectionFilter();
overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);

Also, in your snippet, why call validateNoNewDeletesForDataFiles if we already know the overwrite filter is set? I think validateNoNewDeletesForDataFiles is simply a more efficient version of validateNoNewDeletes that can open delete files that match the filter to check their content for conflicts. The problem is that we can use validateNoNewDeletesForDataFiles only if we overwrite specific files.

* @param dataFilter an expression used to find new conflicting delete files
* @param caseSensitive whether expression evaluation should be case-sensitive
*/
protected void validateNoNewDeletes(TableMetadata base, Long startingSnapshotId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a slightly more accurate name would be validateNoNewDeleteFiles since this checks that there aren't any new delete files, but data files could have been deleted.

@aokolnychyi
Copy link
Contributor Author

Closing this as it was split into multiple PRs and merged.

@aokolnychyi aokolnychyi closed this Oct 1, 2021
@rdblue rdblue removed this from the Java 0.12.1 Release milestone Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants