Skip to content

A production-style real-time ETL pipeline built for real time classification of stellar data. The system ingests survey CSVs via Kafka, processes them with Spark streaming, writes normalized tables to PostgreSQL (with pgAdmin), and exposes interactive analysis via Jupyter — all reproducible with Docker Swarm and a Makefile for quick deployment.

Notifications You must be signed in to change notification settings

avid00/Distributed_ETL_Streaming_Pipeline_for_Astroinformatics

Repository files navigation

Distributed ETL Streaming Pipeline for Astroinformatics

Video Demo

1. Problem Statement:

'Sagan Observatory' is concerned only with stellar data being streamed from the sdss survey which also include quasars and galaxies which are not needed for 'Sagan Observatory's' use case. They need this data in a versatile SQL/table format since flat data is not good for analysis and ML training. They need to be able to query this data and perform analyses, CNN training etc. through jupyter notebooks as their primary tool. Scalability is also required as they plan to take stellar data from other surveys which would mean multiple sources writing into the same table or database.

2. Solution:

A robust, easy to install Docker Swarm set up with pyton-jupyter-kafka-spakr-postgre images where configurations that are not ephemeral. GNU-make adds automation and Dockerfiles acts as IaC that can help the scientists set up the architecture with minimal steps and know-how. This set up is highly scalabale with docker swarm where multiple nodes can be added and through Makefile and Dockerfiles as IaC it is highly automated. Spark and Kafka are employed for data streaming or batch processing. As SQL and table style data storage is most popular in astroinformatics, querying is handled by PostgreSQL with pgAdmin installed for ease of administration. All of this is used through jupyter notebooks which the scientists are already familiar with. A premade pgadmin server with details and configurations has already been made that works with current notebooks however these are fully customisable with jupyter python code.

2.1. Technologies:

  1. Clustering: Docker Swarm with containers
  2. IaC and automation for Deployment: Dockerfile and docker-compose for creating setup images, docker-compose and GNU-Makefile
  3. Big data streaming and ingress: Spark, Kafka (with zookeeper)
  4. Big data storage: PostgreSQL (with pgAdmin)
  5. Big data retreival and queries: Jupyter notebook

2.2. Diagram:

Architecture Diagram

2.3. Requirements:

Host should have the following installed:

  • Docker Swarm
  • Docker Desktop
  • Docker compose
  • unix shell/git bash There are not many specific requirements or pre-requisites to this set up as it installs everything (including jar or tgz files) by itself each time is launched which fulfills the IaC requirement. This build was developed and tested on docker desktop for windows but is based upon a linux environment and other than the make commands (which are customisable), everything is platform-independant and can be deployed on any platform supported by Docker.

3. Directory layout:

Project                                                    
├───conf                                # Configuration files
│   └───spark-defaults.conf             # Spark configuration file
├───data                                # Data-related files and folders (mounted only to spark, not jupyter or kafka)
│   └───results                         # Output directory for Spark app results
│       └───(spark_app script results show up here) 
├───notebooks                           # Jupyter notebooks folder (make notebooks here for persistance)
│   ├───.ipynb_checkpoints              # Jupyter Notebook checkpoint files
│   └───data                            # Shared/mounted folder
│        └───(shared/mounted folder)
├───pgadmin                             # pgAdmin configuration
│   └───servers.json                    # pgAdmin server configuration file for persistance
├───spark_apps                          # Folder containing Spark application scripts
├───.env.spark                          # Environment variables for Spark
├───docker-compose-swarm.yml            # Docker Compose configuration file for docker swarm set up and environment
├───docker-compose.yml                  # Docker Compose configuration file for containers and environment
├───Dockerfile                          # Dockerfile for setting up the container
├───entrypoint.sh                       # Entrypoint script for the Docker container mounted in Dockerfile
├───Makefile                            # Makefile for build automation
├───README.md                           # Project Report and documentation
├───requirements.txt                    # Python dependencies mounted in Dockerfile for installation

4. Steps to Replicate Docker-Swarm Setup

  1. Simply clone respository to your host. Ensure pre-requisites exist and Docker desktop application is running.

  2. Inside the directory open a command shell (for windows) and enter the following command:
    note: Most Makefile commands are windows-shell based.

make setup
  • This will take some time for the first time. note: If/when we make changes to the Dockerfile, we should run the 'make build' command again.

Setup is complete!

Check if services are running by accessing the Services:

make check

####This setup can also be run without the need of swarm with simply running make run after downloading.

Troubleshooting:

  • ensure docker is running
  • check docker status
  • docker info | findstr "Swarm"
  • swarm should be Swarm: active

veify local registry is complete:

curl http://localhost:5000/v2/

5. Steps to Perform Jobs

Submit a Spark Job to Kafka Consumer and Stream on Jupyter console + write to database

  1. Open Jupyter Notebook

    • Access your Jupyter Notebook on localhost:8000
      • Go to Jupyter container in Docker and, inside the logs console, copy the token and paste that in the webpage. Set password as whatever you want!
    • Click on the 'csv_to_postgre' notebook.
  2. Start the Spark Job

    • Run the Spark cell in the notebook to start the job.
  3. Open a Terminal in Kafka Container or Host Shell

    • Execute the following command to access the Kafka container:
      docker exec -it <kafka-container-id> bash

    3.1. Create Kafka Topic

    • Check already created topics:

      kafka-topics --list --bootstrap-server localhost:9092
    • Create a new topic: note: make sure this topic name matches the one you add in your notebook under kafka_df options.

      kafka-topics --create --topic <topic-name> --bootstrap-server localhost:9092
    • Verify the topic creation using the list command. The console will also confirm if the topic is successfully created.

  4. Update Kafka Topic in Notebook

    • Add the Kafka topic name to the options in the kafka_df dictionary of the notebook.
    • Run the cell.
  5. Start Kafka Producer

    • Open the Kafka producer with the following command:
      kafka-console-producer --topic <topic-name> --bootstrap-server localhost:9092
    • This opens the producer where you can submit data. For this script, enter the file row of the sdss data (ensure it is 'star' object the first time so you can see the result). note: do not run rest of the script yet.
  6. PostGRE initialisation

    • Open the postGRE container or run this to open it in shell
    docker exec -it <postgres_container_name> psql -U student -d spark_data

    note: database 'spark_data' is already created in the docker-compose under postgres parameters.

    • Create table
    CREATE TABLE <table-name> (
       objid BIGINT,
       ra DOUBLE PRECISION,
       dec DOUBLE PRECISION,
       u DOUBLE PRECISION,
       g DOUBLE PRECISION,
       r DOUBLE PRECISION,
       i DOUBLE PRECISION,
       z DOUBLE PRECISION,
       run INTEGER,
       rerun INTEGER,
       camcol INTEGER,
       field INTEGER,
       specobjid BIGINT,
       class TEXT,
       redshift DOUBLE PRECISION,
       plate INTEGER,
       mjd INTEGER,
       fiberid INTEGER
       );
    • Veify the table was created:
    \dt
  7. Run the rest of the script.

  8. Check if Data is streamed into postgre

    • Now, enter another row inside the producer in kafka container/terminal
    • Come back to sql container, view the table via following commands:
    SELECT*FROM <table-name>;

We should be able to see the row we entered. Data streaming is complete! We can also check the database on pgAdmin by logging in with

user: admin@admin.com
password: admin

Credentials for the database are

user: student
password: password

Screenshots of implementation:

Docker Swarm status: Screenshot 2024-12-12 230316

Kafka Producer: Screenshot 2024-12-12 201422

Date Inside PostGRESQL: Screenshot 2024-12-12 201415

Data streamed to Database on pgAdmin: Screenshot 2024-12-12 201904

6. Miscellenuous Notes and Troubleshooting

  1. Database name, user password etc. are set in 'docker-compose.yml' and './conf/spark-defaults.conf'
  2. Make command descripitons are given in Makefile
  3. Most of the times errors happen due to wrong indentation in scripts and yml files.
  4. Check port numbers, files names, name of directory in mounting or container ids if something doesn't seem to connect with each other.

Benefits and other features of this set-up:

  1. Easy to set-up
  2. Failover capability with replicas for HA clustering.
  3. Scalable
    • make run-scaled can add as many spark workers as needed through paramter in 'Makefile'
    • docker swarm can set up as many nodes under the registry as needed.
  4. Non-ephemeral thanks to images.
  5. This setup can also be used without the use of kafka or notebooks by submitting a python script inside spark_apps and running that instead.

Conclusions and Future Improvements:

This project started from a docker-spark image and was then expanded to employ jupyter and kafka for real time streaming and later on configurations and integrations for postgreSQL were added and then the entire set up was converted to docker swarm. It is a complete stack with high depoloyability, automation and scalability. Even though the demonstration does not show it, the infrastructure to stream data into the producer also already exist in this setup and can be configured with a simple jupyter notebook python script. The integration of jupyter makes the set up extremely customisable where with the correct script and notebook any data can be streamed through spark-kafka into postgreSQL.

In regards to future work, currently the notebooks are stored locally in hardcoded drives in the docker image, however, for complete cloud based utility it might be best to integrate clouds methods. Throughout the working many errors like network set ups, docker files not Makefile commands are windows based so another makefile with linux and apple based commands can be added in future updates. Only 1 replica is compatible for now. Scripting can enable automatic csv data injection via jupyter as currently it is manual.

Acknowledgements:

Sloan Digital Sky Survey (SDSS) data was used at the courtesy of the SDSS, University of Chicago who made the dataset freely and publicly available on the internet. Custom GPTs were also used for code/script debugging.

References:

https://medium.com/@dhirajmishra57/streaming-data-using-kafka-postgresql-spark-streaming-airflow-and-docker-33c43bfa609d
https://medium.com/@MarinAgli1/setting-up-a-spark-standalone-cluster-on-docker-in-layman-terms-8cbdc9fdd14b
https://github.com/subhamkharwal/spark-streaming-with-pyspark/
https://hub.docker.com/r/apache/spark
https://docs.docker.com/engine/swarm/stack-deploy/
https://spark.apache.org/examples.html https://stackoverflow.com https://www.youtube.com/watch?v=-vPJ6VT9SGs

About

A production-style real-time ETL pipeline built for real time classification of stellar data. The system ingests survey CSVs via Kafka, processes them with Spark streaming, writes normalized tables to PostgreSQL (with pgAdmin), and exposes interactive analysis via Jupyter — all reproducible with Docker Swarm and a Makefile for quick deployment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published