-
Notifications
You must be signed in to change notification settings - Fork 119
Spark Submit Configuration
-
master: Parameter used to set the URL of the master (head) node. Allowed syntax is:
-
local: Used for executing your code on your local machine. If you pass local, Spark will then run in a single thread (without leveraging any parallelism). On a multi-core machine you can specify either, the exact number of cores for Spark to use by stating local[n] where n is the number of cores to use, or run Spark spinning as many threads as there are cores on the machine using local[*].
-
spark://host:port: It is a URL and a port for the Spark standalone cluster (that does not run any job scheduler such as Mesos or Yarn).
-
mesos://host:port: It is a URL and a port for the Spark cluster deployed over Mesos.
-
yarn: Used to submit jobs from a head node that runs Yarn as the workload balancer.
-
deploy-mode: Parameter that allows you to decide whether to launch the Spark driver process locally (using client) or on one of the worker machines inside the cluster (using the cluster option). The default for this parameter is client.
-
--name: Name of your application. Note that if you specified the name of your app programmatically when creating SparkSession then the parameter from the command line will be overridden. We will explain the precedence of parameters shortly when discussing the --conf parameter.
-
--py-files: Comma-delimited list of .py, .egg or .zip files to include for Python apps. These files will be delivered to each executor for use. Later in this chapter we will show you how to package your code into a module.
-
--files: Command gives a comma-delimited list of files that will also be delivered to each executor to use.
-
--conf: Parameter to change a configuration of your app dynamically from the command line. The syntax is =. For example, you can pass --conf spark.local.dir=/home/SparkTemp/ or --conf spark.app.name=learningPySpark; the latter would be an equivalent of submitting the --name property as explained previously.
-
--properties-file: File with a configuration. It should have the same set of properties as the conf/spark-defaults.conf file as it will be read instead of it.
-
--driver-memory: Parameter that specifies how much memory to allocate for the application on the driver. Allowed values have a syntax similar to the 1,000M, 2G. The default is 1,024M.
-
--executor-memory: Parameter that specifies how much memory to allocate for the application on each of the executors. The default is 1G.
-
--help: Shows the help message and exits.
-
--verbose: Prints additional debug information when running your app.
-
--version: Prints the version of Spark.
In a Spark standalone with cluster deploy mode only, or on a cluster deployed over Yarn, you can use the --driver-cores that allows specifying the number of cores for the driver (default is 1). In a Spark standalone or Mesos with cluster deploy mode only you also have the opportunity to use either of these:
- --supervise: Parameter that, if specified, will restart the driver if it is lost or fails. This also can be set in Yarn by setting the --deploy-mode to cluster
- --kill: Will finish the process given its submission_id
- --status: If this command is specified, it will request the status of the specified app
In addition, when submitting to a YARN cluster you can specify:
-
--queue: This parameter specifies a queue on YARN to submit the job to (default is default)
-
--num-executors: Parameter that specifies how many executor machines to request for the job. If dynamic allocation is enabled, the initial number of executors will be at least the number specified. Now that we have discussed all the parameters it is time to put it into practice.
- spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 2g --py-files Titanic_Survival_Prediction-0.1.dev0-py2.7.egg entry.py
- spark-submit --master local --py-files TSP/additionalCode/dist/Titanic_Survival_Prediction-0.1.dev0-py2.7.egg entry.py