'Sagan Observatory' is concerned only with stellar data being streamed from the sdss survey which also include quasars and galaxies which are not needed for 'Sagan Observatory's' use case. They need this data in a versatile SQL/table format since flat data is not good for analysis and ML training. They need to be able to query this data and perform analyses, CNN training etc. through jupyter notebooks as their primary tool. Scalability is also required as they plan to take stellar data from other surveys which would mean multiple sources writing into the same table or database.
A robust, easy to install Docker Swarm set up with pyton-jupyter-kafka-spakr-postgre images where configurations that are not ephemeral. GNU-make adds automation and Dockerfiles acts as IaC that can help the scientists set up the architecture with minimal steps and know-how. This set up is highly scalabale with docker swarm where multiple nodes can be added and through Makefile and Dockerfiles as IaC it is highly automated. Spark and Kafka are employed for data streaming or batch processing. As SQL and table style data storage is most popular in astroinformatics, querying is handled by PostgreSQL with pgAdmin installed for ease of administration. All of this is used through jupyter notebooks which the scientists are already familiar with. A premade pgadmin server with details and configurations has already been made that works with current notebooks however these are fully customisable with jupyter python code.
- Clustering: Docker Swarm with containers
- IaC and automation for Deployment: Dockerfile and docker-compose for creating setup images, docker-compose and GNU-Makefile
- Big data streaming and ingress: Spark, Kafka (with zookeeper)
- Big data storage: PostgreSQL (with pgAdmin)
- Big data retreival and queries: Jupyter notebook
Host should have the following installed:
- Docker Swarm
- Docker Desktop
- Docker compose
- unix shell/git bash There are not many specific requirements or pre-requisites to this set up as it installs everything (including jar or tgz files) by itself each time is launched which fulfills the IaC requirement. This build was developed and tested on docker desktop for windows but is based upon a linux environment and other than the make commands (which are customisable), everything is platform-independant and can be deployed on any platform supported by Docker.
Project ├───conf # Configuration files │ └───spark-defaults.conf # Spark configuration file ├───data # Data-related files and folders (mounted only to spark, not jupyter or kafka) │ └───results # Output directory for Spark app results │ └───(spark_app script results show up here) ├───notebooks # Jupyter notebooks folder (make notebooks here for persistance) │ ├───.ipynb_checkpoints # Jupyter Notebook checkpoint files │ └───data # Shared/mounted folder │ └───(shared/mounted folder) ├───pgadmin # pgAdmin configuration │ └───servers.json # pgAdmin server configuration file for persistance ├───spark_apps # Folder containing Spark application scripts ├───.env.spark # Environment variables for Spark ├───docker-compose-swarm.yml # Docker Compose configuration file for docker swarm set up and environment ├───docker-compose.yml # Docker Compose configuration file for containers and environment ├───Dockerfile # Dockerfile for setting up the container ├───entrypoint.sh # Entrypoint script for the Docker container mounted in Dockerfile ├───Makefile # Makefile for build automation ├───README.md # Project Report and documentation ├───requirements.txt # Python dependencies mounted in Dockerfile for installation
-
Simply clone respository to your host. Ensure pre-requisites exist and Docker desktop application is running.
-
Inside the directory open a command shell (for windows) and enter the following command:
note: Most Makefile commands are windows-shell based.
make setup- This will take some time for the first time. note: If/when we make changes to the Dockerfile, we should run the 'make build' command again.
Setup is complete!
Check if services are running by accessing the Services:
make check####This setup can also be run without the need of swarm with simply running make run after downloading.
- ensure docker is running
- check docker status
- docker info | findstr "Swarm"
- swarm should be
Swarm: active
veify local registry is complete:
curl http://localhost:5000/v2/-
Open Jupyter Notebook
- Access your Jupyter Notebook on localhost:8000
- Go to Jupyter container in Docker and, inside the logs console, copy the token and paste that in the webpage. Set password as whatever you want!
- Click on the 'csv_to_postgre' notebook.
- Access your Jupyter Notebook on localhost:8000
-
Start the Spark Job
- Run the Spark cell in the notebook to start the job.
-
Open a Terminal in Kafka Container or Host Shell
- Execute the following command to access the Kafka container:
docker exec -it <kafka-container-id> bash
3.1. Create Kafka Topic
-
Check already created topics:
kafka-topics --list --bootstrap-server localhost:9092
-
Create a new topic: note: make sure this topic name matches the one you add in your notebook under kafka_df options.
kafka-topics --create --topic <topic-name> --bootstrap-server localhost:9092
-
Verify the topic creation using the list command. The console will also confirm if the topic is successfully created.
- Execute the following command to access the Kafka container:
-
Update Kafka Topic in Notebook
- Add the Kafka topic name to the options in the kafka_df dictionary of the notebook.
- Run the cell.
-
Start Kafka Producer
- Open the Kafka producer with the following command:
kafka-console-producer --topic <topic-name> --bootstrap-server localhost:9092
- This opens the producer where you can submit data. For this script, enter the file row of the sdss data (ensure it is 'star' object the first time so you can see the result). note: do not run rest of the script yet.
- Open the Kafka producer with the following command:
-
PostGRE initialisation
- Open the postGRE container or run this to open it in shell
docker exec -it <postgres_container_name> psql -U student -d spark_data
note: database 'spark_data' is already created in the docker-compose under postgres parameters.
- Create table
CREATE TABLE <table-name> ( objid BIGINT, ra DOUBLE PRECISION, dec DOUBLE PRECISION, u DOUBLE PRECISION, g DOUBLE PRECISION, r DOUBLE PRECISION, i DOUBLE PRECISION, z DOUBLE PRECISION, run INTEGER, rerun INTEGER, camcol INTEGER, field INTEGER, specobjid BIGINT, class TEXT, redshift DOUBLE PRECISION, plate INTEGER, mjd INTEGER, fiberid INTEGER );
- Veify the table was created:
\dt -
Run the rest of the script.
-
Check if Data is streamed into postgre
- Now, enter another row inside the producer in kafka container/terminal
- Come back to sql container, view the table via following commands:
SELECT*FROM <table-name>;
We should be able to see the row we entered. Data streaming is complete! We can also check the database on pgAdmin by logging in with
user: admin@admin.com
password: admin
Credentials for the database are
user: student
password: password
Data streamed to Database on pgAdmin:
- Database name, user password etc. are set in 'docker-compose.yml' and './conf/spark-defaults.conf'
- Make command descripitons are given in Makefile
- Most of the times errors happen due to wrong indentation in scripts and yml files.
- Check port numbers, files names, name of directory in mounting or container ids if something doesn't seem to connect with each other.
- Easy to set-up
- Failover capability with replicas for HA clustering.
- Scalable
- make run-scaled can add as many spark workers as needed through paramter in 'Makefile'
- docker swarm can set up as many nodes under the registry as needed.
- Non-ephemeral thanks to images.
- This setup can also be used without the use of kafka or notebooks by submitting a python script inside spark_apps and running that instead.
This project started from a docker-spark image and was then expanded to employ jupyter and kafka for real time streaming and later on configurations and integrations for postgreSQL were added and then the entire set up was converted to docker swarm. It is a complete stack with high depoloyability, automation and scalability. Even though the demonstration does not show it, the infrastructure to stream data into the producer also already exist in this setup and can be configured with a simple jupyter notebook python script. The integration of jupyter makes the set up extremely customisable where with the correct script and notebook any data can be streamed through spark-kafka into postgreSQL.
In regards to future work, currently the notebooks are stored locally in hardcoded drives in the docker image, however, for complete cloud based utility it might be best to integrate clouds methods. Throughout the working many errors like network set ups, docker files not Makefile commands are windows based so another makefile with linux and apple based commands can be added in future updates. Only 1 replica is compatible for now. Scripting can enable automatic csv data injection via jupyter as currently it is manual.
Sloan Digital Sky Survey (SDSS) data was used at the courtesy of the SDSS, University of Chicago who made the dataset freely and publicly available on the internet. Custom GPTs were also used for code/script debugging.
https://medium.com/@dhirajmishra57/streaming-data-using-kafka-postgresql-spark-streaming-airflow-and-docker-33c43bfa609d
https://medium.com/@MarinAgli1/setting-up-a-spark-standalone-cluster-on-docker-in-layman-terms-8cbdc9fdd14b
https://github.com/subhamkharwal/spark-streaming-with-pyspark/
https://hub.docker.com/r/apache/spark
https://docs.docker.com/engine/swarm/stack-deploy/
https://spark.apache.org/examples.html
https://stackoverflow.com
https://www.youtube.com/watch?v=-vPJ6VT9SGs
