-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode #45607
Conversation
…ad Under K8s Cluster Mode
Ack, @dbtsai . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @leletan .
This PR seems to mix two independent themes. Please create another JIRA for the following. We can proceed that first.
Stop appending primary resource to spark.jars to avoid duplicating the primary resource jar in spark.jars.
"For use in cases when the jars are big and executor counts are high, " + | ||
"concurrent download causes network saturation and timeouts. " + | ||
"Wildcard '*' is denoted to not downloading jars for any the schemes.") | ||
.version("2.3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 4.0.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix and move this to another JIRA & PR.
@dongjoon-hyun Thanks for you advices. |
+CC @zhouyejoe |
Thank you for updating, @leletan . I'll resume the review this weekend. |
I just assigned this to me in order not to forget. It doesn't block any community reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -504,6 +504,25 @@ class SparkSubmitSuite | |||
} | |||
} | |||
|
|||
test("SPARK-47475: Not to add primary resource to jars again" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this JIRA ID is wrong. We need to have SPARK-47495 like the PR title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!!!
Thanks for fixing this!
Merged to master because the last commit only changes JIRA ID in the test case name. |
Welcome to the Apache Spark community, @leletan . I added you to the Apache Spark contributor group (in JIRA) and assigned SPARK-47495 to you. Congratulations for your first commit, @leletan . |
…e under k8s cluster mode ### What changes were proposed in this pull request? In `SparkSubmit`, for `isKubernetesClusterModeDriver` code path, stop appending primary resource to `spark.jars` to avoid duplicating the primary resource jar in `spark.jars`. ### Why are the changes needed? #### Context: To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to `spark.jars` and call `KubernetesClientApplication::start` to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar but that jar will also be in the `spark.jars`). This time the SparkSubmit will run under client mode with `spark.kubernetes.submitInDriver` as `true`. Under this mode, all the jars in `spark.jars` will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource to `spark.jars` again. So in this case, `spark.jars` will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts the `SparkContext` it will copy all the `spark.jars` to `spark.app.initial.jar.urls`, and replace the driver local jars paths in `spark.app.initial.jar.urls` with driver file service paths, with which the executor can download those driver local jars. #### Issues: The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test added. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45607 from leletan/fix_k8s_submit_jar_distribution. Lead-authored-by: jiale_tan <jiale_tan@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…e under k8s cluster mode ### What changes were proposed in this pull request? In `SparkSubmit`, for `isKubernetesClusterModeDriver` code path, stop appending primary resource to `spark.jars` to avoid duplicating the primary resource jar in `spark.jars`. ### Why are the changes needed? #### Context: To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to `spark.jars` and call `KubernetesClientApplication::start` to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar but that jar will also be in the `spark.jars`). This time the SparkSubmit will run under client mode with `spark.kubernetes.submitInDriver` as `true`. Under this mode, all the jars in `spark.jars` will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource to `spark.jars` again. So in this case, `spark.jars` will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts the `SparkContext` it will copy all the `spark.jars` to `spark.app.initial.jar.urls`, and replace the driver local jars paths in `spark.app.initial.jar.urls` with driver file service paths, with which the executor can download those driver local jars. #### Issues: The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test added. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45607 from leletan/fix_k8s_submit_jar_distribution. Lead-authored-by: jiale_tan <jiale_tan@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
In
SparkSubmit
, forisKubernetesClusterModeDriver
code path, stop appending primary resource tospark.jars
to avoid duplicating the primary resource jar inspark.jars
.Why are the changes needed?
Context:
To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to
spark.jars
and callKubernetesClientApplication::start
to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar but that jar will also be in thespark.jars
). This time the SparkSubmit will run under client mode withspark.kubernetes.submitInDriver
astrue
. Under this mode, all the jars inspark.jars
will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource tospark.jars
again. So in this case,spark.jars
will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts theSparkContext
it will copy all thespark.jars
tospark.app.initial.jar.urls
, and replace the driver local jars paths inspark.app.initial.jar.urls
with driver file service paths, with which the executor can download those driver local jars.Issues:
The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test added.
Was this patch authored or co-authored using generative AI tooling?
No