Skip to content

[Bug] [Worker] Failed to submit Spark task in cluster mode #16987



Search before asking

  • I had searched in the issues and found no similar issues.

What happened

DolphinScheduler version: 3.2.2
Deployment: pseudo-cluster
Spark is deployed in a standalone cluster, version: 3.5.4
Resource files are stored using MinIO S3
The configuration files involve api-server/conf/ and worker-server/conf/, The main changes are as follows:<minio access key><minio secret key><ip>:9000

Keep the rest of the configuration as default, After starting the service, the jar file can be uploaded normally.
Then select the SPARK component in the workflow, select the Jar package uploaded to MinIO, and select cluster as the deployment method.
Then run the workflow instance, and the output log attachment is as follows:


The important error information is:

[INFO] 2025-01-24 13:53:34.674 +0800 - *********************************  Execute task instance  *************************************
[INFO] 2025-01-24 13:53:34.675 +0800 - ***********************************************************************************************
[INFO] 2025-01-24 13:53:34.677 +0800 - Final Shell file is: 
[INFO] 2025-01-24 13:53:34.677 +0800 - ****************************** Script Content *****************************************************************
[INFO] 2025-01-24 13:53:34.677 +0800 - #!/bin/bash
BASEDIR=$(cd `dirname $0`; pwd)
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3
${SPARK_HOME}/bin/spark-submit --master spark:// --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi --conf spark.driver.cores=1 --conf spark.driver.memory=512M --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2G /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
[INFO] 2025-01-24 13:53:34.678 +0800 - ****************************** Script Content *****************************************************************
[INFO] 2025-01-24 13:53:34.678 +0800 - Executing shell command : sudo -u default -i /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/
[INFO] 2025-01-24 13:53:34.687 +0800 - process start, process id is: 172698
[INFO] 2025-01-24 13:53:37.688 +0800 -  -> 
	25/01/24 13:53:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	25/01/24 13:53:37 INFO SecurityManager: Changing view acls to: default
	25/01/24 13:53:37 INFO SecurityManager: Changing modify acls to: default
	25/01/24 13:53:37 INFO SecurityManager: Changing view acls groups to: 
	25/01/24 13:53:37 INFO SecurityManager: Changing modify acls groups to: 
	25/01/24 13:53:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: default; groups with view permissions: EMPTY; users with modify permissions: default; groups with modify permissions: EMPTY
[INFO] 2025-01-24 13:53:38.691 +0800 -  -> 
	25/01/24 13:53:37 INFO Utils: Successfully started service 'driverClient' on port 39639.
	25/01/24 13:53:37 INFO TransportClientFactory: Successfully created connection to / after 57 ms (0 ms spent in bootstraps)
	25/01/24 13:53:38 INFO ClientEndpoint: ... waiting before polling master for driver state
	25/01/24 13:53:38 INFO ClientEndpoint: Driver successfully submitted as driver-20250124135338-0056
[INFO] 2025-01-24 13:53:43.693 +0800 -  -> 
	25/01/24 13:53:43 INFO ClientEndpoint: State of driver-20250124135338-0056 is ERROR
	25/01/24 13:53:43 ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
	java.nio.file.NoSuchFileException: /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
		at sun.nio.fs.UnixException.translateToIOException(
		at sun.nio.fs.UnixException.rethrowAsIOException(
		at sun.nio.fs.UnixException.rethrowAsIOException(
		at sun.nio.fs.UnixCopyFile.copy(
		at sun.nio.fs.UnixFileSystemProvider.copy(
		at java.nio.file.Files.copy(
		at org.apache.spark.util.Utils$.copyRecursive(Utils.scala:681)
		at org.apache.spark.util.Utils$.copyFile(Utils.scala:652)
		at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:725)
		at org.apache.spark.util.Utils$.fetchFile(Utils.scala:467)
		at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:162)
		at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:179)
		at org.apache.spark.deploy.worker.DriverRunner$$anon$
	25/01/24 13:53:43 INFO ShutdownHookManager: Shutdown hook called
	25/01/24 13:53:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-2af4f41d-c583-4698-9d8e-546a656bcf17
[INFO] 2025-01-24 13:53:43.695 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6, processId:172698 ,exitStatusCode:255 ,processWaitForStatus:true ,processExitValue:255
[INFO] 2025-01-24 13:53:43.697 +0800 - Start finding appId in /opt/apache-dolphinscheduler-3.2.2-bin/worker-server/logs/20250124/131329769571008/2/6/6.log, fetch way: log 
[INFO] 2025-01-24 13:53:43.698 +0800 - 
[INFO] 2025-01-24 13:53:43.699 +0800 - *********************************  Finalize task instance  ************************************
[INFO] 2025-01-24 13:53:43.699 +0800 - ***********************************************************************************************

From the error message, we can see that although the jar package on MinIO was selected when configuring the workflow, DolphinScheduler still used the local temporary directory as a parameter during runtime, which caused the Spark Driver to fail to read the package and cause an error.

What you expected to happen

Tasks can be submitted and run normally,

How to reproduce

You can reproduce it by following the steps above.

Anything else

The above problem will occur as long as DolphinScheduler and Spark Driver are not running on the same node.



Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct



No one assigned


    questionFurther information is requested


    No type


    No projects


    No milestone


    None yet


    No branches or pull requests

    Issue actions