[SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode #45607

leletan · 2024-03-20T07:33:10Z

What changes were proposed in this pull request?

In SparkSubmit, for isKubernetesClusterModeDriver code path, stop appending primary resource to spark.jars to avoid duplicating the primary resource jar in spark.jars.

Why are the changes needed?

Context:

To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to spark.jars and call KubernetesClientApplication::start to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar but that jar will also be in the spark.jars). This time the SparkSubmit will run under client mode with spark.kubernetes.submitInDriver as true. Under this mode, all the jars in spark.jars will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource to spark.jars again. So in this case, spark.jars will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts the SparkContext it will copy all the spark.jars to spark.app.initial.jar.urls, and replace the driver local jars paths in spark.app.initial.jar.urls with driver file service paths, with which the executor can download those driver local jars.

Issues:

The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test added.

Was this patch authored or co-authored using generative AI tooling?

No

…ad Under K8s Cluster Mode

dbtsai · 2024-03-20T21:56:11Z

cc @dongjoon-hyun

dongjoon-hyun · 2024-03-20T22:05:43Z

Ack, @dbtsai .

dongjoon-hyun

Thank you for making a PR, @leletan .

This PR seems to mix two independent themes. Please create another JIRA for the following. We can proceed that first.

Stop appending primary resource to spark.jars to avoid duplicating the primary resource jar in spark.jars.

dongjoon-hyun · 2024-03-20T22:08:12Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        "For use in cases when the jars are big and executor counts are high, " +
+        "concurrent download causes network saturation and timeouts. " +
+        "Wildcard '*' is denoted to not downloading jars for any the schemes.")
+      .version("2.3.0")


This should be 4.0.0.

Good catch.

Will fix and move this to another JIRA & PR.

leletan · 2024-03-21T03:05:00Z

@dongjoon-hyun Thanks for you advices.
Updated this PR to focus only on the "duplicate primary resource download" issues and associated it with a new JIRA https://issues.apache.org/jira/browse/SPARK-47495.
Updated the previous JIRA https://issues.apache.org/jira/browse/SPARK-47475 by reducing the scope to only scaling issues. Will create another PR for that JIRA later.
Please let me know if this looks good to you. Thanks!

mridulm · 2024-03-21T05:54:12Z

+CC @zhouyejoe

dongjoon-hyun · 2024-03-22T17:30:11Z

Thank you for updating, @leletan .

I'll resume the review this weekend.

dongjoon-hyun · 2024-03-22T17:32:10Z

I just assigned this to me in order not to forget. It doesn't block any community reviews.

dongjoon-hyun

+1, LGTM for Apache Spark 4.0.0.

Thank you, @leletan , @dbtsai , @mridulm .

Merged to master.

dongjoon-hyun · 2024-03-24T00:07:31Z

core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

    }
  }

+  test("SPARK-47475: Not to add primary resource to jars again" +


Oh, this JIRA ID is wrong. We need to have SPARK-47495 like the PR title.

Good catch!!!
Thanks for fixing this!

dongjoon-hyun · 2024-03-24T00:08:57Z

Merged to master because the last commit only changes JIRA ID in the test case name.

dongjoon-hyun · 2024-03-24T00:10:59Z

Welcome to the Apache Spark community, @leletan .

I added you to the Apache Spark contributor group (in JIRA) and assigned SPARK-47495 to you.

Congratulations for your first commit, @leletan .

[Spark-47475][Core] Fix Executors Scaling Issues Caused by Jar Downlo…

11e2cd3

…ad Under K8s Cluster Mode

github-actions bot added the CORE label Mar 20, 2024

re-trigger test after enabling action in forked repo

8d4003a

HyukjinKwon changed the title ~~[Spark-47475][Core] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode~~ [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode Mar 20, 2024

re-trigger test after enabling action in forked repo

2681fb5

dongjoon-hyun requested changes Mar 20, 2024

View reviewed changes

dongjoon-hyun reviewed Mar 20, 2024

View reviewed changes

leletan changed the title ~~[SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode~~ [SPARK-47475][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode Mar 21, 2024

move some changes to a separate PR as per comment

27dff12

leletan changed the title ~~[SPARK-47475][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode~~ [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode Mar 21, 2024

leletan requested a review from dongjoon-hyun March 21, 2024 04:21

dongjoon-hyun self-assigned this Mar 22, 2024

dongjoon-hyun approved these changes Mar 24, 2024

View reviewed changes

dongjoon-hyun reviewed Mar 24, 2024

View reviewed changes

Update SparkSubmitSuite.scala

4cdcc93

dongjoon-hyun closed this in c29d132 Mar 24, 2024

leletan deleted the fix_k8s_submit_jar_distribution branch March 26, 2024 03:34

dongjoon-hyun removed their assignment Apr 14, 2024

[SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode #45607

[SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode #45607

Uh oh!

Conversation

leletan commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Context:

Issues:

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dbtsai commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 20, 2024

Choose a reason for hiding this comment

Uh oh!

leletan Mar 20, 2024

Choose a reason for hiding this comment

Uh oh!

leletan Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leletan commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Mar 21, 2024

Uh oh!

dongjoon-hyun commented Mar 22, 2024

Uh oh!

dongjoon-hyun commented Mar 22, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 24, 2024

Choose a reason for hiding this comment

Uh oh!

leletan Mar 26, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 24, 2024

Uh oh!

dongjoon-hyun commented Mar 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leletan commented Mar 20, 2024 •

edited

Loading

leletan Mar 21, 2024 •

edited

Loading

leletan commented Mar 21, 2024 •

edited

Loading