Skip to content

Conversation

@TongWei1105
Copy link
Contributor

@TongWei1105 TongWei1105 commented May 28, 2025

What changes were proposed in this pull request?

This PR fixes a bug where submitting a Spark job using the --files option and also calling SparkContext.addFile() for a file with the same name causes Spark to throw an exception
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text

Why are the changes needed?

  1. Submit a Spark application using spark-submit with the --files option:
    bin/spark-submit --files s3://bucket/a.text --class testDemo app.jar
  2. In the testDemo application code, call:
    sc.addFile("a.text", true)

This works correctly in YARN mode, but throws an error in Kubernetes mode.
After SPARK-33782, in Kubernetes mode, --files, --jars, --archiveFiles, and --pyFiles are all downloaded to the working directory.

However, in the code, args.files = filesLocalFiles, and filesLocalFiles refers to a temporary download path, not the working directory.
This causes issues when user code like testDemo calls sc.addFile("a.text", true), resulting in an error such as:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text

Does this PR introduce any user-facing change?

This issue can be resolved after this PR.

How was this patch tested?

Existed UT

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the CORE label May 28, 2025
@TongWei1105
Copy link
Contributor Author

Hi @dongjoon-hyun ,
Could you please review this PR when you have time? Many thanks!

@TongWei1105 TongWei1105 changed the title [SPARK-52334][CORE][K8S] update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded [SPARK-52334][CORE][K8S] update all files, jars, and pyFiles to reference the working directory after they are downloaded May 28, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you can write a test case, @TongWei1105 ?

@TongWei1105
Copy link
Contributor Author

TongWei1105 commented May 29, 2025

Do you think you can write a test case, @TongWei1105 ?
@dongjoon-hyun
Thanks for the suggestion! I've added a unit test,please feel free to review it when convenient. Appreciate your feedback!

@TongWei1105 TongWei1105 requested a review from dongjoon-hyun May 29, 2025 08:54
Copy link
Contributor

@mridulm mridulm Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this test to handle archive as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this test to handle archive as well ?

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log4j2.properties is not an archive file - and so ends up getting copied for destination (variant of existing cases).
I am trying to ensure that if (isArchive) { works as expected when the file actually results in unpacking the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log4j2.properties is not an archive file - and so ends up getting copied for destination (variant of existing cases). I am trying to ensure that if (isArchive) { works as expected when the file actually results in unpacking the file

The isArchive logic does get triggered in this case — the test has been updated to cover that scenario accordingly.
Thank you for your suggestion.

@TongWei1105 TongWei1105 requested a review from mridulm June 3, 2025 04:54
@TongWei1105 TongWei1105 force-pushed the SPARK-52334 branch 5 times, most recently from 818ff74 to 4195283 Compare June 4, 2025 10:19
@TongWei1105
Copy link
Contributor Author

When you have a moment, could you please take another look at this PR? Thanks!
@dongjoon-hyun @mridulm

@TongWei1105 TongWei1105 closed this Jun 9, 2025
@mridulm mridulm reopened this Jun 21, 2025
@mridulm
Copy link
Contributor

mridulm commented Jun 21, 2025

Sorry for the delay @TongWei1105 - I assume you closed PR due to lack of traction, if so reopening it.
The change looks reasonable to me.
I would prefer if someone more familiar with k8s also takes a look as well.

+CC @dongjoon-hyun , @HyukjinKwon

localResources
} else {
Files.copy(source.toPath, dest.toPath)
dest.toURI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be source.toURI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dest.toURI, we are placing the file names from the current working directory. I mean yeah this should also work but I wonder why we should do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this fixes the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this fixes the issue?

Thank you for your reply.
In Kubernetes mode, when using --files or --jars, Spark first stores a copy under the local /tmp directory and also copies it to /opt/spark/work-dir/. However, when addFile(file) is called a second time inside the SparkContext, the file path becomes /opt/spark/work-dir/file. In NettyStreamManager, however, the file entries are still recorded using the original /tmp path, so when it tries to guard against duplicate file registrations, a mismatch occurs and an exception is thrown.

Therefore, I believe that in Kubernetes mode, these paths should be unified to /opt/spark/work-dir/.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think looks fine to me but leave this to @dongjoon-hyun who more uses K8S in production

@TongWei1105
Copy link
Contributor Author

@dongjoon-hyun , when you're free, could you help me review this?

@dongjoon-hyun
Copy link
Member

Sorry guys for missing ping here. I can test this Today.

@dongjoon-hyun dongjoon-hyun self-assigned this Dec 5, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main body itself looks good to me.

Please try to avoid new binary file like, archive1.zip. You can create it simply like the following, @TongWei1105 .

val zipFile1 = File.createTempFile("test1_", ".zip", dir)
TestUtils.createJar(Seq(text1, json1), zipFile1)

…rking directory after they are downloaded.

fix

add ut

add ut for archives

fix
@TongWei1105
Copy link
Contributor Author

The main body itself looks good to me.

Please try to avoid new binary file like, archive1.zip. You can create it simply like the following, @TongWei1105 .

val zipFile1 = File.createTempFile("test1_", ".zip", dir)
TestUtils.createJar(Seq(text1, json1), zipFile1)

@dongjoon-hyun done

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @TongWei1105 and all.

Merged to master/4.1 for Apache Spark 4.1.0.

dongjoon-hyun pushed a commit that referenced this pull request Dec 5, 2025
…ence the working directory after they are downloaded

### What changes were proposed in this pull request?

This PR fixes a bug where submitting a Spark job using the --files option and also calling SparkContext.addFile() for a file with the same name causes Spark to throw an exception
`Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text`

### Why are the changes needed?

1. Submit a Spark application using spark-submit with the --files option:
`bin/spark-submit --files s3://bucket/a.text --class testDemo app.jar `
2. In the testDemo application code, call:
`sc.addFile("a.text", true)`

This works correctly in YARN mode, but throws an error in Kubernetes mode.
After [SPARK-33782](https://issues.apache.org/jira/browse/SPARK-33782), in Kubernetes mode, --files, --jars, --archiveFiles, and --pyFiles are all downloaded to the working directory.

However, in the code, args.files = filesLocalFiles, and filesLocalFiles refers to a temporary download path, not the working directory.
This causes issues when user code like testDemo calls sc.addFile("a.text", true), resulting in an error such as:
`Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text`

### Does this PR introduce _any_ user-facing change?

This issue can be resolved after this PR.

### How was this patch tested?

Existed UT

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #51037 from TongWei1105/SPARK-52334.

Authored-by: TongWei1105 <vvtwow@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit dd418e3)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun removed their assignment Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants