Build your own Spark cluster setup in Docker.
A multinode Spark installation where each node of the network runs in its own separated Docker container.
The installation takes care of the Hadoop & Spark configuration, providing:
- a debian image with scala and java (scalabase image)
- four fully configured Spark nodes running on Hadoop (sparkbase image):
- nodemaster (master node)
- node2 (slave)
- node3 (slave)
- node4 (slave)
You can run Spark in a (boring) standalone setup or create your own network to hold a full cluster setup inside Docker instead.
I find the latter much more fun:
- you can experiment with a more realistic network setup
- tweak nodes configuration
- simulate scalability, downtimes and rebalance by adding/removing nodes to the network automagically
There is a Medium article related to this: https://medium.com/@rubenafo/running-a-spark-cluster-setup-in-docker-containers-573c45cceabf
- Clone this repository
- cd scalabase
- ./build.sh # This builds the base java+scala debian container from openjdk9
- cd ../spark
- ./build.sh # This builds sparkbase image
- run ./cluster.sh deploy
- The script will finish displaying the Hadoop and Spark admin URLs:
- Hadoop info @ nodemaster: http://172.18.1.1:8088/cluster
- Spark info @ nodemaster : http://172.18.1.1:8080/
- DFS Health @ nodemaster : http://172.18.1.1:9870/dfshealth.html
cluster.sh stop # Stop the cluster
cluster.sh start # Start the cluster
cluster.sh info # Shows handy URLs of running cluster
# Warning! This will remove everything from HDFS
cluster.sh deploy # Format the cluster and deploy images again