This is a quick description about how to configure Spark when using ml. Given the source{d} engine
uses Spark to load git repositories, this document also covers it's usage in the first section. For more information about how engine
works, you can go check out the repository.
--repository-format
: Theengine
expects you to point towards folders containing Siva files by default, however using this flag you can also choose to use regular git repositories withstandard
or bare git repositories withbare
;--bblfsh
: If you are not running theBabelfish
daemon locally, for example if you are using a container, you can specify the server's address with this flag;--explain
: PySpark optimizes the data pipeline under the hood, if you wish to see in what way you can print the execution plan with this flag;--engine
: If you wish to provide a different version to Spark then the installed one, you can specify it with this flag.
-s
/--spark
: Specify Spark's master address, defaults to local[*] (your local spark with all available cores);--package
: Shortcut to specify the name of any additional Spark package you may want to add to the workers;--pause
: If you wish to not terminate the Spark session at the end (until you pressEnter
), useful to inspect the Spark UI;--dep-zip
: If given, will zip dependencies, should only be used in Cluster mode (this flag might change soon);-m
/--memory
: Shortcut to specify values for the executor memory, the driver memory and the maximum driver result size, in this order. If you do not know what these values correspond to, the links in the next section will help;--spark-local-dir
: Shortcut to specify the directory Spark uses to store data, defaults to/tmp/spark
;--spark-log-level
: Shortcut to specify the Spark log level, one of "ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE" or "WARN", defaults to "WARN";--persist
: Shortcut to enable caching, and specify the Spark storage level. Caching is done by using aCacher
transformer in the code, that just calls the Spark RDDpersist
method with the storage level specified by this flag, unless it has been omitted in which case it does nothing.
Here is a quick recap on the different spark storage levels:
Level Space used CPU time In memory On disk Serialized
-------------------------------------------------------------------------
MEMORY_ONLY High Low Y N N
MEMORY_ONLY_SER Low High Y N Y
MEMORY_AND_DISK High Medium Some Some Some
MEMORY_AND_DISK_SER Low High Some Some Y
DISK_ONLY Low High N Y Y
As you can see if you have a ton of RAM and relatively small data, "MEMORY_ONLY" is good, if you have a ton of data, enough space and a bad CPU, "DISK_ONLY" might be preferable. Tests are advised to find what works best, however when running on large data this flag should not be omitted, in order to avoid recomputing some of the transformations, which can be a lengthy process.
--config
: This flag allows you to directly specify any of the parameters of the Spark Configuration, the format iskey=value
(e.g.--config spark.executor.memory=4G spark.driver.memory=10G spark.driver.maxResultSize=4G
).
Most of the tuning that can be done on ml
is Spark related, so here are some useful resources to learn how to do that:
- The top answer to this StackOverflow question explains how to tune all executor related parameters. It is not up to date regarding how to specify the number of executors, which Spark calculates with:
floor(spark.cores.max / spark.executor.cores)
(so you can just specify both parameters); - Tuning driver parameters is a bit more tricky, given it really depends on the job you are running. You can check out answers to this StackOverflow question for more details;
- The
Repartionner
transformer is upcoming, however if you already wish to tune your data's partitioning, this StackOverflow question explains the difference between the two main parameters that you can use, and if you do not know why partitioning matters then this may help; - This talk and the ones you can find from there is pretty useful if you plan on contributing to
ml
.