support adding extra commit metadata with SQL in Spark by CodingCat · Pull Request #4956 · apache/iceberg

CodingCat · 2022-06-03T15:58:04Z

this PR implements the functionality for users to add extra commit metadata when operating tables with SQL. It also allows users to use multi threading to commit data to tables while keeping metadata thread local

new usage:

(0 until 10).foreach { _ =>


    new Thread() {
      override def run() {
        CommitMetadata.withCommitProperties(Map("metadata-key" -> "thread-local-metadata-value").asJava,
          () => {
            SparkSession.getActiveSession.get.sql("INSERT INTO target VALUES (3, 'c'), (4, 'd')");
          })
      }
    }.start()

  }
}

CodingCat · 2022-06-03T16:00:27Z

Hi, @kbendick @rdblue I just made this PR as a followup of #4795, please help to review and thanks in advance! once we agree on the approach here, I will add changes to other versions of Spark

rdblue · 2022-06-03T16:04:07Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestDataSourceOptions.java

+    String tableLocation = temp.newFolder("iceberg-table").toString();
+    HadoopTables tables = new HadoopTables(CONF);
+    int threadsCount = 3;
+    ExecutorService executorService = Executors.newFixedThreadPool(threadsCount, new ThreadFactory() {


Can you use the helper in ThreadPools and also wrap it in a try/finally to close the threadpool?

I'm also not entirely sure this requires a threadpool to test. I think it would be fine to test a single write in the current thread.

I would also prefer a test that uses a single write in the current thread without any additional threading business. I worry that CommitMetadata doesn't seem thread local to users.

And then if a multi-threaded test is needed, using the helpers from ThreadPools as suggested.

updated to ThreadPool, I think multi-threading testing is still necessary? as we need to have something guarding that the commit metadata change is thread safe no matter we use ThreadLocal as now or later we change to something else for any reason

I don't think multi-threaded testing is needed. It's enough to know that we're using a thread-local. This also is not guaranteed to run the way this test assumes that it will. There is not a guarantee that the thread pool will scale all the way up, and there's no guarantee that the tasks will each run in a separate thread. I think it's likely that those will happen, but this could still be a source of flakiness later on. Also, this doesn't necessarily test that the thread-local is working properly because there's no guarantee of concurrency across tasks.

While it's probably working the way you expect, there's no guarantee that it must. So I'd prefer to keep the test simple.

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestDataSourceOptions.java

kbendick · 2022-06-03T17:37:17Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/CommitMetadata.java

+/**
+ * utility class to accept thread local commit properties
+ */
+public class CommitMetadata {


Does this class need to be public? Could it be made package-private?

I have concerns around the usage of ThreadLocal for things that most cases don't need to be thread local. I don't want to give users too much room to hurt themselves because they don't consider that CommitMetadata is only threadlocal and then their writes not working properly in the common case of writes without user-side multithreading (e.g. it gets set in one thread somewhere, but another thread is used for commit).

EDIT - Since this takes a Callable, it's less of a concern. I would still name it in a way that's a bit more reflective of the thread local nature (especially if we wanted a CommitMetadata class one day that doens't require a callable and is persistent). That and I always prefer things be package-private if possible.

sure, changed to CallerWithCommitMetadata....but...eh...not sure if it is better or worse....

Does this class need to be public? Could it be made package-private?

Yes, this does need to be public because it is a way for Iceberg users to pass metadata.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/CommitMetadata.java

kbendick

Thinking this over more, there's really no way to achieve this use-case without this. So outside of the style nits Ryan mentioned, this looks pretty much good to me.

Thanks @CodingCat!

CodingCat · 2022-06-04T05:22:04Z

thanks! @rdblue and @kbendick , just updated the PR

CodingCat · 2022-06-05T23:11:55Z

just to confirm, seems we don't support SQL based table insert/merge in Spark 2.4 at all right? (so this functionality is not relevant there)

rdblue

What was wrong with the CommitMetadata name? I think we should have a simpler name than CallerWithCommitMetadata. That's a bit too confusing. I think the original name was good.

CodingCat · 2022-06-06T02:51:13Z

What was wrong with the CommitMetadata name? I think we should have a simpler name than CallerWithCommitMetadata. That's a bit too confusing. I think the original name was good.

#4956 (comment) @kbendick mentioned some potential conflict with SS's CommitMetadata here, I don't have strong opinion on this

singhpk234 · 2022-06-06T16:44:43Z

spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/CallerWithCommitMetadata.java

+      return callable.call();
+    } catch (Throwable e) {
+      ExceptionUtil.castAndThrow(e, exClass);
+      return null;


[minor] is this required as we throw in the line above ?

I don't think the compiler sees that castAndThrow will always throw, so it needs this to know what to do.

singhpk234 · 2022-06-06T16:50:57Z

spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

+    if (!CallerWithCommitMetadata.commitProperties().isEmpty()) {
+      CallerWithCommitMetadata.commitProperties().forEach(operation::set);
+    }


[doubt](Probably not in scope of PR) Should we add a validation that the keys passed from here as well as extraSnapshotMetadata doesn't override the keys already present in the snasphot-summary ? Or this functionality is intended to do

I don't think that we need to worry about this. It is unlikely to conflict and if it does conflict, it's up to the caller to decide what to do.

singhpk234 · 2022-06-06T16:54:51Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java

+    if (!CallerWithCommitMetadata.commitProperties().isEmpty()) {
+      CallerWithCommitMetadata.commitProperties().forEach(operation::set);
+    }


[question] Should we also add this to SparkPositionDeltaWrite

iceberg/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

Line 250 in b06a89c

extraSnapshotMetadata.forEach(operation::set);

since starting 3.2 iceberg support MOR with pos deletes

Yes, that's a good idea.

rdblue · 2022-06-06T17:11:36Z

@CodingCat, I think renaming the class back is about the only thing left to fix.

kbendick · 2022-06-06T17:12:52Z

What was wrong with the CommitMetadata name? I think we should have a simpler name than CallerWithCommitMetadata. That's a bit too confusing. I think the original name was good.

#4956 (comment) @kbendick mentioned some potential conflict with SS's CommitMetadata here, I don't have strong opinion on this

I’m good with the original name too (and prefer it to the new one).

CodingCat · 2022-06-06T22:00:24Z

@kbendick @rdblue @singhpk234 updated the PR accordingly, thanks!

rdblue · 2022-06-06T22:48:26Z

Thanks, @CodingCat!

apache#4956) This is needed because Spark cannot pass additional metadata for some operations.

CodingCat added 6 commits May 31, 2022 22:03

thread local

807cbd1

temp

df42e42

test

54cea12

formatting

22cc18a

test

8ee624c

move to spark-core

dcb888e

github-actions bot added the spark label Jun 3, 2022

CodingCat mentioned this pull request Jun 3, 2022

expose the latest snapshot id committed within a thread #4795

Closed

rdblue reviewed Jun 3, 2022

View reviewed changes