With so many Hadoop and Spark builds out there on Docker, why would we create another one? This was for two reasons:
- We had a lot of trouble getting many of these builds to run and work and
- Many of the builds were using older versions of Hadoop and Spark.
Therefore, we decided to comb through many of the builds looking for pieces we could stitch together that would work with the latest versions and actually deploy seamlessly on a distributed cluster. We ended up borrowing ideas and parts of builds from a handful of developers as well as adding our own work arounds.
I tried to remember and list everyone I used as a reference in resources; however, if you notice developer we forgot to attribute, send us the link to their (or your) GitHub and Docker repos. We will be able to recall if we went through code by seeing the name.
I had to adapt my set up from many different configurations. Here is the short list of what I can remember.
- big-data-europe
- Getty Images
- SequenceIQ
- Jamie Pillora for DNSmasq
- Eric Hough for NFS server
- Portainer
Before you start using these images or Dockerfiles
, make sure to go through
the scripts, configuration files, and Dockerfiles
looking for network specific
or user specific settings to change. Most will be marked <name>
.
- DNSmasq directory
- dnsmasq.conf
- dnsmasq-run.sh
- Hadoop directory
- Dockerfile
- entrypoint.sh
- datanode directory
- Dockerfile
- run.sh
- namenode directory
- Dockerfile
- run.sh
- Spark directory
- Dockerfile
- Anaconda Python directory
- Dockerfile
- Data Science directory
- Dockerfile
- requirements.txt
- conda_requirement_install.sh
- Grafana directory
- docker-compose-monitor.yml
- Docker Swarm Dashboard-1537760402324.json
- swarm-dashboard_rev1.json
- build.sh
- docker-compose.yml
- hadoop.env
- pyspark_example.py
Run the build script in the current directory containing cluster directories. The build scripts has six arguments. In our case, we would name them:
- hadoop
- namenode
- datanode
- spark
- anaconda
- datascience
The usages would be
sh ./build.sh hadoop namenode datanode spark anaconda datascience
where the argument names are optional. The default values are shown above.
This will build, tag, and push our images to our docker registry at
<dns_name>:5000
.
- Move the
.conf
and.sh
files to desired machine to act as the server. - Setup the desired configuration in
.conf
. - Run
sh ./dnsmasq-run.sh
.
- Update the
docker-compose.yml
to the desired settings, specify the image names, and set the replica number for the datanodes. Replica must be equal to or less than the number of available datanodes. - Update the
hadoop.env
file to configure Hadoop. - On the machine that will be used as the manager node, launch the docker
compose file with
docker stack deploy -c <file.yml> <name of network> docker stack deploy -c docker-compose.yml cluster
- For the first run on the system, the
join-token
will be needed for the workers. On the manager node, rundocker swarm join-token worker
to see the token. - ssh into all the worker nodes and paste the token into the command line.
- Enter the container with
docker exec -it cluster_master.... bash
note that you can auto-complete the container name withtab
. - Test Hadoop is running correctly with
$HADOOP_HOME/bin/hdfs dfs -cat /user/test.md
which will print outSuccess
to the command line. - To drop a worker, ssh into the desired machine and enter
docker swarm leave
.
Will add later
- Get a list of the running services with
docker service ls
. - Then run
docker service rm <name, name, ...>
to rm the service. - Then run
docker stack rm <stack-name>
(docker stack rm cluster
) to remove the stack.
Once the cluster is running, we access the manager node and enter the following commands:
-
docker service create \ --name portainer \ --network hadoop-spark-swarm-network \ --publish 9000:9000 \ --mount src=portainer_data,dst=/data \ --replicas=1 \ --constraint 'node.role == manager' \ portainer/portainer -H "tcp://tasks.portainer_agent:9001" --tlsskipverify
-
docker service create \ --name portainer_agent \ --network hadoop-spark-swarm-network \ -e AGENT_CLUSTER_ADDR=tasks.portainer_agent \ --mode global \ --constraint 'node.platform.os == linux' \ --mount type=bind,src=//var/run/docker.sock,dst=/var/run/docker.sock \ --mount type=bind,src=//var/lib/docker/volumes,dst=/var/lib/docker/volumes \ portainer/agent
- Check portainer.io
<dns_name>:9000
for the join status of the workers.
Once the cluster is running, we access the manager node and enter the following commands:
- Deploy grafana, influxDB, Cadvisor with
docker stack deploy -c docker-compose-monitor.yml monitor
- Create cadvisor database in influxDB container(
docker exec -it <influxDB container ID> bash
) withinflux -execute 'CREATE DATABASE cadvisor'
- In your browser, access grafana dashboard with
ip:80
- Login with
user: admin
andpassword: admin
and change the password - Add source: change Name to
influx
, Type toinfluxDB
, URL tohttp://influx:8086
, and Database tocadvisor
. Then click thesave and test
button - Import dashboard with the
Manage
button (in left panel) withDocker Swarm Dashboard-<gen_id>.json
<ip>:9001
https://botleg.com/stories/monitoring-docker-swarm-with-cadvisor-influxdb-and-grafana/
In the docker-compose-spark.yml
, we needed to add
volumes:
nfsshare:
driver: local
driver_opts:
type: "nfs"
o: "addr=<ip_add>,nfsvers=4"
device: ":/"
Next, log into the manager node, <dns_name>
and run
docker run -e NFS_EXPORT_0='/nfsshare
*(rw,fsid=0,async,no_subtree_check,no_auth_nlm,insecure,no_root_squash)'
--privileged -p 2049:2049 -d --name glcalcs erichough/nfs-server
Once the cluster is running, if a new worker needs to be added, we can run
docker service scale <container_name>=<num>
where <container_name>
is correct container you would like to spin up
another worker on (Hadoop, Spark) and <num>
is an integer greater than what
was set in the docker-compose.yml
as replica but less than or equal to the
total number of workers available. For example, if we have a cluster of 5
workers and would like to scale Spark to 7
workers, we would run
docker service scale spark_swarm_worker=7
By replacing the token with a password, we can store http://dns_name:8888
into
our browser and simple navigate to this address as opposed to copying the
address and token when Jupyter lab or notebook launches.
This is optional. Simply comment out the lines at the bottom of the Dockerfile
if you don't want to set this up (anaconda folder).
- Run
jupyter notebook --generate-config
to generate the config file. - Run
jupyter notebook password
and enter your password twice. - Copy
jupyter_notebook_config.py
to the anaconda folder containing theDockerfile
. - Build the docker.
We can run a jupyter notebook or lab from the bash of the anaconda or datascience images. From the bash, run the command
jupyter-lab --allow-root --ip=0.0.0.0
Additionally, the environment variable JUPYTER_LAB
has been created to take
care of the options needing to be passed. Then copy the URL into your local
web browser and change the ip address to host machines ip or DNS name.
Will add later
Much of this couldn't have been done without the help of Yok.