Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode #45607

Closed

Conversation

leletan
Copy link

@leletan leletan commented Mar 20, 2024

What changes were proposed in this pull request?

In SparkSubmit, for isKubernetesClusterModeDriver code path, stop appending primary resource to spark.jars to avoid duplicating the primary resource jar in spark.jars.

Why are the changes needed?

Context:

To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to spark.jars and call KubernetesClientApplication::start to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar but that jar will also be in the spark.jars). This time the SparkSubmit will run under client mode with spark.kubernetes.submitInDriver as true. Under this mode, all the jars in spark.jars will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource to spark.jars again. So in this case, spark.jars will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts the SparkContext it will copy all the spark.jars to spark.app.initial.jar.urls, and replace the driver local jars paths in spark.app.initial.jar.urls with driver file service paths, with which the executor can download those driver local jars.

Issues:

The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test added.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Mar 20, 2024
@HyukjinKwon HyukjinKwon changed the title [Spark-47475][Core] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode Mar 20, 2024
@dbtsai
Copy link
Member

dbtsai commented Mar 20, 2024

cc @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Ack, @dbtsai .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @leletan .

This PR seems to mix two independent themes. Please create another JIRA for the following. We can proceed that first.

Stop appending primary resource to spark.jars to avoid duplicating the primary resource jar in spark.jars.

"For use in cases when the jars are big and executor counts are high, " +
"concurrent download causes network saturation and timeouts. " +
"Wildcard '*' is denoted to not downloading jars for any the schemes.")
.version("2.3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 4.0.0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Copy link
Author

@leletan leletan Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix and move this to another JIRA & PR.

@leletan leletan changed the title [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [SPARK-47475][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode Mar 21, 2024
@leletan leletan changed the title [SPARK-47475][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode Mar 21, 2024
@leletan
Copy link
Author

leletan commented Mar 21, 2024

@dongjoon-hyun Thanks for you advices.
Updated this PR to focus only on the "duplicate primary resource download" issues and associated it with a new JIRA https://issues.apache.org/jira/browse/SPARK-47495.
Updated the previous JIRA https://issues.apache.org/jira/browse/SPARK-47475 by reducing the scope to only scaling issues. Will create another PR for that JIRA later.
Please let me know if this looks good to you. Thanks!

@leletan leletan requested a review from dongjoon-hyun March 21, 2024 04:21
@mridulm
Copy link
Contributor

mridulm commented Mar 21, 2024

+CC @zhouyejoe

@dongjoon-hyun
Copy link
Member

Thank you for updating, @leletan .

I'll resume the review this weekend.

@dongjoon-hyun dongjoon-hyun self-assigned this Mar 22, 2024
@dongjoon-hyun
Copy link
Member

I just assigned this to me in order not to forget. It doesn't block any community reviews.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache Spark 4.0.0.

Thank you, @leletan , @dbtsai , @mridulm .

Merged to master.

@@ -504,6 +504,25 @@ class SparkSubmitSuite
}
}

test("SPARK-47475: Not to add primary resource to jars again" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this JIRA ID is wrong. We need to have SPARK-47495 like the PR title.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!!!
Thanks for fixing this!

@dongjoon-hyun
Copy link
Member

Merged to master because the last commit only changes JIRA ID in the test case name.

@dongjoon-hyun
Copy link
Member

Welcome to the Apache Spark community, @leletan .

I added you to the Apache Spark contributor group (in JIRA) and assigned SPARK-47495 to you.

Congratulations for your first commit, @leletan .

@leletan leletan deleted the fix_k8s_submit_jar_distribution branch March 26, 2024 03:34
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
…e under k8s cluster mode

### What changes were proposed in this pull request?

In `SparkSubmit`, for `isKubernetesClusterModeDriver` code path, stop appending primary resource to `spark.jars` to avoid duplicating the primary resource jar in `spark.jars`.

### Why are the changes needed?

#### Context:

To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to `spark.jars` and call `KubernetesClientApplication::start`  to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar  but that jar will also be in the `spark.jars`). This time the SparkSubmit will run under client mode with `spark.kubernetes.submitInDriver`  as `true`. Under this mode, all the jars in `spark.jars` will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource to `spark.jars` again. So in this case, `spark.jars` will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts the `SparkContext` it will copy all the `spark.jars` to `spark.app.initial.jar.urls`, and replace the driver local jars paths in `spark.app.initial.jar.urls` with driver file service paths, with which the executor can download those driver local jars.

#### Issues:
The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test added.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45607 from leletan/fix_k8s_submit_jar_distribution.

Lead-authored-by: jiale_tan <jiale_tan@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun dongjoon-hyun removed their assignment Apr 14, 2024
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024
…e under k8s cluster mode

### What changes were proposed in this pull request?

In `SparkSubmit`, for `isKubernetesClusterModeDriver` code path, stop appending primary resource to `spark.jars` to avoid duplicating the primary resource jar in `spark.jars`.

### Why are the changes needed?

#### Context:

To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to `spark.jars` and call `KubernetesClientApplication::start`  to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar  but that jar will also be in the `spark.jars`). This time the SparkSubmit will run under client mode with `spark.kubernetes.submitInDriver`  as `true`. Under this mode, all the jars in `spark.jars` will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource to `spark.jars` again. So in this case, `spark.jars` will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts the `SparkContext` it will copy all the `spark.jars` to `spark.app.initial.jar.urls`, and replace the driver local jars paths in `spark.app.initial.jar.urls` with driver file service paths, with which the executor can download those driver local jars.

#### Issues:
The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test added.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45607 from leletan/fix_k8s_submit_jar_distribution.

Lead-authored-by: jiale_tan <jiale_tan@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants