You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Spark] Fix O(n^2) issue in find last complete checkpoint before (#3060)
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
3. Be sure to keep the PR description updated to reflect all changes.
4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->
#### Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)
## Description
This PR fixes an O(n^2) issue in `find last complete checkpoint before`
method.
Today the `findLastCompleteCheckpointBefore` tries to find the last
checkpoint before a given version. In order to do this, it do following:
findLastCompleteCheckpointBefore(10000):
1. List from 9000
2. List from 8000
3. List from 7000
...
Each of these listing today lists to the end as they completely ignore
delta files and try to list with takeWhile with version clause:
```
listFrom(..)
.filter { file => isCheckpointFile(file) && file.getLen != 0 }
.map{ file => CheckpointInstance(file.getPath) }
.takeWhile(tv => (cur == 0 || tv.version <= cur) && tv < upperBoundCv)
```
This PR tries to fix this issue by terminating each listing early by
checking if we have crossed a deltaFile for untilVersion.
In addition to this, we also optimize how much to list in each
iteration.
E.g. After this PR, findLastCompleteCheckpointBefore(10000) will need:
1. Iteration-1 lists from 9000 to 10000.
2. Iteration-2 lists from 8000 to 9000.
3. Iteration-3 lists from 7000 to 8000.
4. and so on...
## How was this patch tested?
UT
0 commit comments