API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors #8755

aokolnychyi · 2023-10-09T17:09:47Z

This PR has code to parallelize reading of deletes and enable caching them on executors.

I also have a follow-up change to assign tasks for one partition to the same executor, similar to KafkaRDD. There is no way to express task affinity so we can only rely on task locality. The solution in KafkaRDD is simple to implement but won't work well if dynamic allocation is enabled (so it should be hidden under a flag).

More thoughts in this doc.

aokolnychyi · 2023-10-09T17:27:18Z

core/src/main/java/org/apache/iceberg/SystemConfigs.java

+      new ConfigEntry<>(
+          "iceberg.worker.delete-num-threads",
+          "ICEBERG_WORKER_DELETE_NUM_THREADS",
+          4 * Runtime.getRuntime().availableProcessors(),


This value may sound ridiculous but here is my thought process: there is one such thread pool per JVM, each core in an executor can get a data task that may need to load 1 to many delete files, these tasks are I/O intensive. This value essentially means we can try to load 4 delete files concurrently per each data task. The cache is also blocking to prevent reading the same files twice.

There is really no good way to pick the default for this since
Thread_pool_size = fn(cores, io_wait_time/compute_time), and your guess is as good as anyone else's whether 4 is a good number for the environment the code is going to run on.
Assuming io_wait time/compute_time == 5, a factor of 4 above would give you a utilization of 80% which sounds pretty good.

Is this end user configurable? If not then it probably needs to be.

Yeah, it is configurable, we just need to make sure the default value is reasonable.

api/src/main/java/org/apache/iceberg/util/CharSequenceMap.java

aokolnychyi · 2023-10-09T17:28:50Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -236,6 +236,8 @@ private TableProperties() {}
  public static final String DELETE_PLANNING_MODE = "read.delete-planning-mode";
  public static final String PLANNING_MODE_DEFAULT = PlanningMode.AUTO.modeName();

+  public static final String SPARK_EXECUTOR_CACHE_ENABLED = "read.spark.executor-cache.enabled";


We will need to discuss how to enable/disable and configure this cache.

Passed in as a hadoop conf property, or a catalog property since we want this to be end user configurable? So probably not as part of table properties?

aokolnychyi · 2023-10-09T17:32:29Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

@@ -125,6 +126,25 @@ public static StructLikeSet toEqualitySet(
    }
  }

+  public static <T extends StructLike> CharSequenceMap<PositionDeleteIndex> toPositionIndexes(


Unlike toPositionIndex below, this one does not filter the deletes for a particular data file. Instead, it builds an index for each referenced data file and returns a map. It useful when the entire delete file can be cached.

This seems useful as a javadoc

core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java

aokolnychyi · 2023-10-09T17:33:53Z

core/src/main/java/org/apache/iceberg/util/ThreadPools.java

+   * @return an {@link ExecutorService} that uses the delete worker pool
+   * @see SystemConfigs#DELETE_WORKER_THREAD_POOL_SIZE
+   */
+  public static ExecutorService getDeleteWorkerPool() {


We don't usually use get but we have getWorkerPool() above so I matched it for consistency.

You might have to. And maybe will need to pass in the configured thread pool size as an argument.

aokolnychyi · 2023-10-09T17:34:33Z

data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class BaseDeleteLoader implements DeleteLoader {


I added this abstraction cause it was too much to put into DeleteFilter. Take a look at the parent interface.

I questioned this a bit and do believe it is better to have this logic separately from DeleteFilter. That said, the exact API and whether it should be a top-level class are still open questions.

data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java

aokolnychyi · 2023-10-09T17:36:58Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

-      return createDeleteIterable(records, isDeleted);
-    }
-
-    return hasIsDeletedColumn


This is a separate discussion but I think we should drop streaming position deletes. We originally added them when we used a set, not bitmaps.

aokolnychyi · 2023-10-09T17:38:13Z

data/src/main/java/org/apache/iceberg/data/DeleteLoader.java

+import org.apache.iceberg.deletes.PositionDeleteIndex;
+import org.apache.iceberg.util.StructLikeSet;
+
+public interface DeleteLoader {


singhpk234 · 2023-10-10T00:55:02Z

Should we apply some intelligence on how we are distributing the tasks so that we could utilize the max from the executor cache ? For ex : lets say we could prefer sending those set of data files which have a lot of overlapping delete files or may be belong to some partition (for ex : position deletes) ?

aokolnychyi · 2023-10-10T01:02:49Z

Should we apply some intelligence on how we are distributing the tasks so that we could utilize the max from the executor cache ? For ex : lets say we could prefer sending those set of data files which have a lot of overlapping delete files or may be belong to some partition (for ex : position deletes) ?

@singhpk234, I have a follow-up change to do that. Unfortunately, it is a bit controversial. There is no way to express task affinity in Spark, only locality. The best option for us is to implement what KafkaRDD does. The problem is that it only works well if dynamic allocation is disabled. Even without that, this feature should be useful. The change to assign tasks for the same partition to one executor is around 20 lines of code.

singhpk234 · 2023-10-10T01:03:54Z

Thanks @aokolnychyi looking forward to it :) !

aokolnychyi · 2023-10-10T01:12:18Z

I tested this PR on a cluster a bit. It would be nice if someone could also play around with it in their environment.

parthchandra · 2023-10-10T22:02:24Z

core/src/main/java/org/apache/iceberg/SystemConfigs.java

+      new ConfigEntry<>(
+          "iceberg.worker.delete-num-threads",
+          "ICEBERG_WORKER_DELETE_NUM_THREADS",
+          4 * Runtime.getRuntime().availableProcessors(),


There is really no good way to pick the default for this since
Thread_pool_size = fn(cores, io_wait_time/compute_time), and your guess is as good as anyone else's whether 4 is a good number for the environment the code is going to run on.
Assuming io_wait time/compute_time == 5, a factor of 4 above would give you a utilization of 80% which sounds pretty good.

parthchandra · 2023-10-10T22:05:42Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -236,6 +236,8 @@ private TableProperties() {}
  public static final String DELETE_PLANNING_MODE = "read.delete-planning-mode";
  public static final String PLANNING_MODE_DEFAULT = PlanningMode.AUTO.modeName();

+  public static final String SPARK_EXECUTOR_CACHE_ENABLED = "read.spark.executor-cache.enabled";


Passed in as a hadoop conf property, or a catalog property since we want this to be end user configurable? So probably not as part of table properties?

parthchandra · 2023-10-10T22:08:24Z

core/src/main/java/org/apache/iceberg/SystemConfigs.java

+      new ConfigEntry<>(
+          "iceberg.worker.delete-num-threads",
+          "ICEBERG_WORKER_DELETE_NUM_THREADS",
+          4 * Runtime.getRuntime().availableProcessors(),


Is this end user configurable? If not then it probably needs to be.

parthchandra · 2023-10-10T22:12:35Z

core/src/main/java/org/apache/iceberg/util/ThreadPools.java

+  public static final int DELETE_WORKER_THREAD_POOL_SIZE =
+      SystemConfigs.DELETE_WORKER_THREAD_POOL_SIZE.value();
+
+  private static final ExecutorService DELETE_WORKER_POOL =


If the size of the thread pool is end user configurable this will not work. But you could initialize the thread pool lazily in getDeleteWorkerPool() and presumably there will be some way to read the end user configured value at that point.

parthchandra · 2023-10-10T22:13:44Z

core/src/main/java/org/apache/iceberg/util/ThreadPools.java

+   * @return an {@link ExecutorService} that uses the delete worker pool
+   * @see SystemConfigs#DELETE_WORKER_THREAD_POOL_SIZE
+   */
+  public static ExecutorService getDeleteWorkerPool() {


You might have to. And maybe will need to pass in the configured thread pool size as an argument.

data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java

aokolnychyi · 2023-11-20T21:57:21Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

@@ -452,6 +454,59 @@ private static void checkSchemaCompatibility(
    }
  }

+  public static long defaultSize(Types.NestedField field) {


I think we have 2 options to limit the size of the cache:

Add ways to limit the size of equality and position delete files to be loaded.

Add ways to indicate the maximum cache size in bytes.

The first one is easy but we need to account that delete files may store extra columns and it will be hard to configure these options correctly cause users have no idea how much the cached representation will actually occupy. That's why I went with the estimation approach and let users configure the max cache size in bytes.

Why do we need to cache the extra columns? We wouldn't be using them?

We don't cache them. Only equality columns. However, fileSizeInBytes includes the total size, not the projection.

So to understand, user configures size of equality deletes (but based on whole row sizes), but position deletes based on 2* record count? Would it be easier to we consider having two configured cache sizes (position delete , eq delete cache sizes)

The idea is to ask the user for the total size of the cache in memory and estimate the actual size to honor those values. For equality deletes, we will rely on the number of records and types of equality columns.

core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java

RussellSpitzer · 2023-11-29T17:52:54Z

core/src/main/java/org/apache/iceberg/SystemConfigs.java

   */
  public static final ConfigEntry<Integer> DELETE_WORKER_THREAD_POOL_SIZE =
      new ConfigEntry<>(
          "iceberg.worker.delete-num-threads",
          "ICEBERG_WORKER_DELETE_NUM_THREADS",
-          Math.max(2, Runtime.getRuntime().availableProcessors()),
+          Math.max(2, 4 * Runtime.getRuntime().availableProcessors()),


Not a huge deal but we are avoiding the RevCheck here by putting our multiplier in a constant here. We should probably move the 4 into a field so future modifications trigger the Rev checker

I can do that but is it something we want to expose to others? Would the goal be to bring the attention or to prohibit future modifications?

You can keep it private, it's to prohibit future mods

If we keep it private, it won't break revapi. I am not sure about a public one. I'll check other places we have.

Looks like we don't do that for other properties as well.
I'd be open to explore that but for all properties in a separate PR.

RussellSpitzer · 2023-11-29T17:54:39Z

core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java

@@ -27,6 +27,15 @@ class BitmapPositionDeleteIndex implements PositionDeleteIndex {
    roaring64Bitmap = new Roaring64Bitmap();
  }

+  void merge(PositionDeleteIndex other) {


Why not just only allow BitmapPositionDeleteIndex here? Do we not have the type when we call merge?

There is also EmptyPositionDeleteIndex. I actually started with BitmapPositionDeleteIndex. I may need to go back and check with fresh eyes.

I switched to accepting BitmapPositionDeleteIndex here and moved casts to the utility class.

aokolnychyi · 2024-01-09T22:03:38Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkExecutorCache.java

+      long start = System.currentTimeMillis();
+      V value = valueSupplier.get();
+      long end = System.currentTimeMillis();
+      LOG.info("Loaded value for {} with size {} in {} ms", key, valueSize, (end - start));


I am going back and forth on the log level here. I'd say it is a fragile place and it is better to always have more logs for now. I don't expect a huge number of these lines. We do have pretty detailed logs for broadcasts in Spark, for instance.

That said, we can switch to debug if everyone thinks it would be better.

szehon-ho

Left one comment, but rest looks good to me now. Thanks for the changes

szehon-ho · 2024-01-11T11:19:12Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+        long entrySize = OBJECT_HEADER + defaultSize(map.keyType()) + defaultSize(map.valueType());
+        return OBJECT_HEADER + 5 * entrySize;
+      default:
+        return 16;


I see in a lot of places in code (SparkValueConverters, ExpressionUtil), if we dont match the type, we throw UnsupportedOperationException, I just felt its better to realize we miss a type here than give a random estimate and never know to fix it?

…n executors

aokolnychyi · 2024-01-16T18:18:15Z

I gave this PR a round of testing on the cluster and it seems to work as expected.

aokolnychyi · 2024-01-16T18:21:37Z

Thanks for reviewing, @szehon-ho @RussellSpitzer!

…apache#8755)

This change backports PR #8755 and PR #9583 to Spark 3.4.

…apache#8755)

…he#9603) This change backports PR apache#8755 and PR apache#9583 to Spark 3.4.

github-actions bot added API spark core data build labels Oct 9, 2023

aokolnychyi force-pushed the executor-delete-cache branch from cbb20b6 to 5ec9097 Compare October 9, 2023 17:21

aokolnychyi commented Oct 9, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/util/CharSequenceMap.java Outdated Show resolved Hide resolved

aokolnychyi commented Oct 9, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java Outdated Show resolved Hide resolved

aokolnychyi commented Oct 9, 2023

View reviewed changes

data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java Outdated Show resolved Hide resolved

aokolnychyi commented Oct 9, 2023

View reviewed changes

parthchandra reviewed Oct 10, 2023

View reviewed changes

wypoon mentioned this pull request Oct 12, 2023

Core: Use ParallelIterable in Deletes::toPositionIndex (6387) #8805

Merged

aokolnychyi mentioned this pull request Nov 11, 2023

API: Optimize equals in CharSequenceWrapper #9035

Merged

aokolnychyi force-pushed the executor-delete-cache branch from 5ec9097 to 30802cc Compare November 19, 2023 20:46

aokolnychyi changed the title ~~[WIP] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors~~ API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors Nov 19, 2023

aokolnychyi commented Nov 20, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java Outdated Show resolved Hide resolved

aokolnychyi mentioned this pull request Nov 21, 2023

API, Core: Optimize CharSequenceMap for file paths #9126

Closed

RussellSpitzer reviewed Nov 29, 2023

View reviewed changes

aokolnychyi force-pushed the executor-delete-cache branch 6 times, most recently from c497048 to 5e0392b Compare January 9, 2024 21:51

aokolnychyi commented Jan 9, 2024

View reviewed changes

aokolnychyi force-pushed the executor-delete-cache branch 2 times, most recently from f0615aa to d743936 Compare January 10, 2024 09:25

szehon-ho approved these changes Jan 11, 2024

View reviewed changes

aokolnychyi force-pushed the executor-delete-cache branch from d743936 to d5d7f79 Compare January 12, 2024 22:03

API, Core, Spark 3.5: Parallelize reading of deletes and cache them o…

16486d3

…n executors

aokolnychyi force-pushed the executor-delete-cache branch from d5d7f79 to 16486d3 Compare January 14, 2024 23:10

Fix concurrent tests

38011e1

aokolnychyi closed this Jan 16, 2024

aokolnychyi reopened this Jan 16, 2024

aokolnychyi added 3 commits January 15, 2024 17:58

Smth

e4ff713

Fix tests

627c08a

Another attempt to fix concurrent tests

4b6c4c5

aokolnychyi merged commit 684f7a7 into apache:main Jan 16, 2024

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024

Core, Spark 3.5: Read deletes in parallel and cache them on executors (…

c03189b

…apache#8755)

adnanhemani pushed a commit to adnanhemani/iceberg that referenced this pull request Jan 30, 2024

Core, Spark 3.5: Read deletes in parallel and cache them on executors (…

c7e448b

…apache#8755)

aokolnychyi mentioned this pull request Jan 31, 2024

Spark 3.4: Read deletes in parallel and cache them on executors #9603

Merged

aokolnychyi added a commit that referenced this pull request Feb 2, 2024

Spark 3.4: Read deletes in parallel and cache them on executors (#9603)

65a076d

This change backports PR #8755 and PR #9583 to Spark 3.4.

zhongqishang mentioned this pull request Feb 19, 2024

[Improvement]: The optimizer adds a cache of eq delete files to reduce repeated IO cost of eq delete files apache/amoro#2553

Closed

3 tasks

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Core, Spark 3.5: Read deletes in parallel and cache them on executors (…

b4a461b

…apache#8755)

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Spark 3.4: Read deletes in parallel and cache them on executors (apac…

dbbe323

…he#9603) This change backports PR apache#8755 and PR apache#9583 to Spark 3.4.

singhpk234 mentioned this pull request Nov 26, 2024

SparkExecutorCache causes slowness of RewriteDataFilesSparkAction #11648

Open

3 tasks

API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors #8755

API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors #8755

Conversation

aokolnychyi commented Oct 9, 2023 • edited Loading

aokolnychyi Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

singhpk234 commented Oct 10, 2023 • edited Loading

aokolnychyi commented Oct 10, 2023 • edited Loading

singhpk234 commented Oct 10, 2023

aokolnychyi commented Oct 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Jan 16, 2024

aokolnychyi commented Jan 16, 2024

aokolnychyi commented Oct 9, 2023 •

edited

Loading

aokolnychyi Oct 9, 2023 •

edited

Loading

aokolnychyi Oct 9, 2023 •

edited

Loading

aokolnychyi Oct 9, 2023 •

edited

Loading

singhpk234 commented Oct 10, 2023 •

edited

Loading

aokolnychyi commented Oct 10, 2023 •

edited

Loading