Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Add a scan for changes #4870
API: Add a scan for changes #4870
Changes from 1 commit
e8b1100
7b6e580
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we restrict this scan task for one snapshot? I feel like itself and its subclass can be used for a scan across multiple snapshots, which is useful, for example, to scan position deletes across snapshots. In that case, commitSnapshotId is going to be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we scan for changes across snapshots, I think we should output all changes per snapshot by default. If I understand you correctly, you want to support use cases when the user is only interested in net changes across snapshots. I think that's possible too but the records produced by a task would still have some snapshot ID when they were committed.
For instance, we add data file A in S1. Then we add a delete file D in S2 that matches data file A. In this case, we would output
AddedDataFileScanTask
with commit snapshot S1 but it will have a delete file D associated with it so that we will only output records that remain there in S2.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will
commitSnapshotId()
return in that case? s1 or s2, or a list of snapshot ids? I'd think the snapshot ids may not be useful if we want to calculate the net changes across snapshots. For example,IncrementalAppendScan
could be considered as one of cases that outputs scan tasks across snapshots, which doesn't need the snapshot id.One solution is to add another layer of interface to support the net changes across snapshots. For example, we can use
ChangelogScanTask
as a parent for both cases of one snapshot and multiple snapshots, and then create a new interfaceSnapshotChangelogScanTask
to extend it for use case of a single snapshot.I think the current solution is good enough for CDC implementation. But, since it is an interface change, we may want to make sure it is extendable for the future use cases. If it is something we will do to support net changes across snapshots, we may choose the right the right hierarchy and names now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more about this, I agree that commit snapshot ID doesn't make sense for net changes across multiple snapshots in some cases. For instance, when we have 2 snapshots that add delete files for the same data file. In this case, we will have a single task with one data file and two deletes added in multiple snapshots.
Having said that, I am still not sure having separate tasks is a good idea. Apart from more tasks to support in readers, what schema to produce? Our
changelog
metadata table should have the same schema independently whether we include only net changes or all changes per each snapshot.We can make the commit snapshot ID null when only net changes are produced, skip the commit snapshot ID in both cases or offer some sort of a map of metadata where the snapshot ID may or may not be set.
Let me think more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to leave
commitSnapshotId
andcommitOrder
null in case of net changes. I assume the intent to have methodcommitSnapshotId
andcommitOrder
is to enable reader project them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyrain @rdblue, I agree with you but what about
DeletedRowsScanTask
?Suppose I have a data file
file_A
:And then I have two delete snapshots S1 and S2. S1 deletes IDs 1, 2 and S2 deletes IDs 3, 4. If we are to assign snapshot IDs to tasks, we will have to produce two
DeletedRowsScanTask
s:Even if we combine these two tasks in one group, it will be suboptimal as we will scan through the same data file twice. It would be more efficient to combine these deletes against the same data file and do one pass to determine deleted records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi, I'd say that's a special form of combining. We could introduce a
CombinedDeletedRowsScanTask
that takes care of the snapshot handling.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't be able to extend
ChangelogScanTask
inCombinedDeletedRowsScanTask
, though? Unless we remove the commit snapshot ID fromChangelogScanTask
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but I'm thinking that we would just throw an exception if it were called. I'd also be up for returning null and passing each sub-task's ID and ordinal separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An exception sounds reasonable. Using nulls in a public API is also not the best idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about setting the "end" snapshot ID correct? We should probably note that here since that vocab is used in the previous snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I aligned methods above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want
changelog
or simplychange
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenzwu, how do you feel about the names? You know more about CDC use cases. I think I'd slightly prefer shorter names like
ChangeScan
andChangeScanTask
but ifchangelog
is really widely used, I have no problem using it as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems
changelog
is pretty popular. Let's keep it as-is.