API: Refactor FileScanTask #5077

aokolnychyi · 2022-06-17T18:00:08Z

No description provided.

api/src/main/java/org/apache/iceberg/InputSplit.java

aokolnychyi · 2022-06-17T20:04:19Z

api/src/main/java/org/apache/iceberg/Scan.java

@@ -29,7 +29,7 @@
 * Scan objects are immutable and can be shared between threads. Refinement methods, like
 * {@link #select(Collection)} and {@link #filter(Expression)}, create new TableScan instances.
 */
-public interface Scan<T extends Scan<T>> {
+public interface Scan<ThisT, T extends ScanTask, S extends InputSplit<T>> {


I am debating whether we need to attach a boundary to ThisT. It won't harm but we don't do that in any other places and it makes the definition a bit more cumbersome as Scan has 3 params now.

This is still open.

api/src/main/java/org/apache/iceberg/Scan.java

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java

api/src/main/java/org/apache/iceberg/BaseInputSplit.java

aokolnychyi · 2022-06-17T22:29:19Z

core/src/main/java/org/apache/iceberg/BaseFileScanTask.java

@@ -217,6 +218,18 @@ public Iterable<FileScanTask> split(long splitSize) {
      throw new UnsupportedOperationException("Cannot split a task which is already split");
    }

+    @Override
+    public boolean isAdjacent(FileScanTask other) {
+      return file().equals(other.file()) && offset + len == other.start();


I copied this logic from an existing place but we have to remember that files will be equal only if references are equal, which I think is true here. Our data and delete file implementations don't override equals.

I think this is correct.

In the other code we assume there is an ordering of the files.

1 is adjacent to 2
2 is not adjacent to 1

We may want to change the api definition just to note that it is not commutative

What about updating the code to cover both in the future?

rdblue · 2022-06-17T22:57:45Z

api/src/main/java/org/apache/iceberg/SplittableScanTask.java

+
+package org.apache.iceberg;
+
+public interface SplittableScanTask<ThisT> extends ScanTask {


I like the addition of SplittableScanTask. This looks pretty good.

rdblue · 2022-06-17T23:01:07Z

Overall I think this looks good. Are you happy with the direction this is going, @aokolnychyi?

aokolnychyi · 2022-06-17T23:14:47Z

@rdblue, I spent quite some time looking at it so I am no longer sure :) We need to sort out some details but it does unlock having tasks that are not FileScanTask and also parametrizes our scans. It is probably worth it.

api/src/main/java/org/apache/iceberg/ContentScanTask.java

api/src/main/java/org/apache/iceberg/FileScanTask.java

api/src/main/java/org/apache/iceberg/SplittableScanTask.java

RussellSpitzer · 2022-06-21T15:42:54Z

api/src/main/java/org/apache/iceberg/SplittableScanTask.java

+
+package org.apache.iceberg;
+
+public interface SplittableScanTask<ThisT> extends ScanTask {


Do we need this to be SplittableScanTask since I think our only usage is in combining? I do like the idea here though I just wonder if we will actually need to re-split or if we should always be producing minimally sized splits (one row group) and then combining them?

Avro is still an issue I guess ...

We actually leverage split before combining in TableScanUtil.

Avro is still splittable, right? I am not sure whether we persist split offsets in the metadata so planning may not be as optimal as for Parquet.

Our Avro code I believe just splits wherever the offsets fall. I just meant the difference between row-group splittable and generically splittable

RussellSpitzer

Left some comments, my biggest questions are about some of our API contracts. Just want to make sure we are clear on what we expect as args and what we expect to return.

szehon-ho

Generally looks good, added a few comments.

There may be quite some code to change to adopt these new ways to split/combine tasks?

api/src/main/java/org/apache/iceberg/Scan.java

api/src/main/java/org/apache/iceberg/BaseScanTaskGroup.java

szehon-ho · 2022-06-21T22:27:44Z

api/src/main/java/org/apache/iceberg/BaseScanTaskGroup.java

+
+  public BaseScanTaskGroup(List<T> tasks) {
+    Preconditions.checkNotNull(tasks, "tasks cannot be null");
+    this.tasks = Lists.newArrayList(tasks);


Just use ImmutableList.of() and then return it directly in tasks()?

This is on purpose to avoid Kryo serialization issues. We usually rely on arrays but generics complicate things. I'll need to pass around Class<T> to make it work with arrays. I added tests to make sure Kryo works with mutable lists. I think that's also true for Flink but I can switch to arrays if mutable lists are a problem.

What about using this?

List<T> asList = Lists.newArrayList(tasks); Preconditions.checkArgument(asList.size() > 0, "..."); this.taskArray = (T[]) Array.newInstance(asList.get(0).getClass(), asList.size());

@rdblue, I am not sure this will be safe, unfortunately. If we are to support lists with multiple task types, we can't assume the rest of the list has the same type as the first element. We may end up with an array store exception at runtime.

Suppose we have List<ParentTask> with two elements of type ChildTask1 and ChildTask2. If we create an array of type ChildTask1, we won't be able to store ChildTask2 in it (even if we cast the array to the parent interface). It will compile but probably fail at runtime.

We could just use Object[] then?

Yeah, we can use an object array. I also added a transient list to avoid building a list on each call.

api/src/main/java/org/apache/iceberg/CombinedScanTask.java

api/src/main/java/org/apache/iceberg/Scan.java

api/src/main/java/org/apache/iceberg/ContentScanTask.java

flyrain · 2022-06-22T18:20:19Z

LGTM

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

api/src/main/java/org/apache/iceberg/CombinableScanTask.java

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

rdblue · 2022-06-24T17:45:04Z

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

+    CombinableScanTask<? extends T> lastCombinableTask = null;
+
+    for (T task : tasks) {
+      if (task instanceof CombinableScanTask<?>) {


I don't think this is correct. It doesn't matter if the next task is combineable. It only matters if the last task was. And we can't keep around the last combineable task because then we would possibly combine tasks out of order.

I think lastCombineableTask should be lastTask and this should check whether lastTask is a combineable in order to try combining with the current task. The new task, if not combined, should always be set as the new lastTask.

You are right. The old logic assumed we can combine only tasks of the same type so this logic worked. If we decide to go this route and have CombinableScanTask, I'll adapt. I didn't want to change this logic completely before discussing CombinableScanTask.

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

aokolnychyi · 2022-06-26T21:13:12Z

api/src/main/java/org/apache/iceberg/MergeableScanTask.java

+ *
+ * @param <ThisT> the child Java API class
+ */
+public interface MergeableScanTask<ThisT> extends ScanTask {


I kept it separate from SplittableScanTask as FileScanTask is splittable but SplitScanTask is mergable.

aokolnychyi · 2022-06-26T21:14:32Z

core/src/test/java/org/apache/iceberg/util/TestTableScanUtil.java

@@ -88,4 +94,109 @@ public void testPlanTaskWithDeleteFiles() {
          expectedCombinedTasks.get(i).files(), combinedScanTasks.get(i).files());
    }
  }
+
+  @Test
+  public void testTaskGroupPlanning() {


These are tests where only some tasks are splittable and mergable.

rdblue · 2022-06-28T03:02:34Z

api/src/main/java/org/apache/iceberg/BaseScanTaskGroup.java

+  @SuppressWarnings("unchecked")
+  public Collection<T> tasks() {
+    if (taskList == null) {
+      synchronized (this) {


I don't have a problem with this, but do you really expect this to be accessed from different threads after construction?

I did it just in case given how hard it would be debug such issues. Better be safe than sorry :)

rdblue · 2022-06-28T03:06:17Z

Looks good to me! Merge when you're ready.

Do we also want to deprecate CombinedScanTask so we can just use the group after 1.0?

aokolnychyi · 2022-06-28T03:51:34Z

I haven't made my mind on deprecating CombinedScanTask cause it is so widely used...
We will also have to break TableScan if we switch to ScanTaskGroup.

aokolnychyi · 2022-06-28T03:58:28Z

Thanks for reviewing, @RussellSpitzer @szehon-ho @flyrain @rdblue! I know it was a tricky one.

rdblue · 2022-06-28T23:50:52Z

I haven't made my mind on deprecating CombinedScanTask cause it is so widely used...
We will also have to break TableScan if we switch to ScanTaskGroup.

Let's not then. Not worth it.

github-actions bot added API core flink labels Jun 17, 2022

aokolnychyi mentioned this pull request Jun 17, 2022

API: Add a scan for changes #4870

Merged

aokolnychyi force-pushed the refactor-file-scan-task branch from 5f1c64d to 1e61473 Compare June 17, 2022 19:50

API: Refactor FileScanTask

79699fc

aokolnychyi force-pushed the refactor-file-scan-task branch from 1e61473 to 79699fc Compare June 17, 2022 19:53

aokolnychyi commented Jun 17, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/InputSplit.java Outdated Show resolved Hide resolved

aokolnychyi commented Jun 17, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/Scan.java Outdated Show resolved Hide resolved

aokolnychyi commented Jun 17, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 17, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java Outdated Show resolved Hide resolved

aokolnychyi commented Jun 17, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/BaseInputSplit.java Outdated Show resolved Hide resolved

aokolnychyi commented Jun 17, 2022

View reviewed changes

rdblue reviewed Jun 17, 2022

View reviewed changes