Real-Time Streaming with Unstructured Data on AWS using Apache Spark

This project demonstrates a real-time streaming pipeline for processing unstructured data using Apache Spark, AWS S3, AWS Glue, and AWS Athena. It continuously consumes TXT, CSV, and JSON files, converts them into a structured format, and stores them in Parquet on S3 for analytics.

Project Overview

Unstructured data cannot be displayed in rows and columns like relational databases. Examples include:

Text files, JSON, CSV
Spreadsheets
Logs and job postings

Challenges of unstructured data:

Requires more storage
Harder to manage and protect with legacy solutions

This project builds a scalable, fault-tolerant pipeline to process unstructured data in real-time and store it in a structured format.

Architecture

The project uses a Spark Master-Worker cluster running on Docker:

spark-master: Master node managing the cluster
spark-worker-1/2/3: Worker nodes executing Spark jobs

Docker Compose configuration spins up the cluster:

version: "3.8"

services:
  spark-master:
    image: bitnami/spark:3.5.1
    container_name: spark-master
    environment:
      SPARK_MODE: master
    ports:
      - "9090:8080"
      - "7077:7077"
      - "4040:4040"
    volumes:
      - ./jobs:/opt/bitnami/spark/jobs
    networks:
      - spark-network
    restart: always

  spark-worker-1:
    image: bitnami/spark:3.5.1
    environment:
      SPARK_MODE: worker
      SPARK_MASTER_URL: spark://spark-master:7077
    networks:
      - spark-network
    restart: always
# ...additional worker nodes similar to worker-1

networks:
  spark-network:
    driver: bridge

Run the cluster:

docker compose up -d

AWS Services

Amazon S3

Stores raw unstructured files (JSON, TXT)
Acts as a data lake for processed data
Output stored as Parquet files for analytics

AWS Glue

Fully managed ETL solution
Uses a crawler to scan S3 and infer schema
Stores metadata in the Glue Data Catalog
Creates tables (Bronze layer) for structured data

AWS Athena

Serverless SQL querying over S3
Queries processed, structured data
Integrated with Glue Data Catalog for table discovery

Project Setup

Create project directory:

mkdir aws_spark_unstructured
cd aws_spark_unstructured
python -m venv venv

Activate virtual environment:

# Linux/macOS
source venv/bin/activate
# Windows
venv\Scripts\activate

Install dependencies (if any, e.g., PySpark):

pip install pyspark boto3

Set AWS credentials in config.py:

configuration = {
    'AWS_ACCESS_KEY': '--AWS ACCESS KEY--',
    'AWS_SECRET_KEY': '--AWS SECRET KEY--'
}

Spark Job Overview

File: main.py

Reads TXT and JSON files as streams
Applies UDFs + Regex to extract fields like:
- File name, Position, Class Code
- Salary, Start/End dates
- Requirements, Duties, Notes
- Selection, Experience length, Education length
- Application location
Union text and JSON streams into a single DataFrame
Writes output to S3 in Parquet format with checkpointing

UDFs and Regex Extraction

UDF (User-Defined Function):

Custom Spark function for complex parsing
Handles unstructured data fields not natively supported by Spark

Regex is used to extract structured information:

Salary: ₹50,000 - ₹70,000
Date: Posted: 2023-08-01
Company Name, Location, Job requirements, etc.

Pipeline Flow:

Text and JSON read as streams
Apply UDFs + Regex to extract structured columns
Union all sources into a single DataFrame
Write continuously to S3 Parquet files

Running the Pipeline

Locally

spark-submit main.py

Inside Docker Cluster

docker exec -it spark-master spark-submit --master spark://spark-master:7077 jobs/main.py

The job reads new files from input_text/ and input_json/ folders
Processes and streams them to S3
Monitor logs for schema and streaming output

Conclusion

This project integrates Apache Spark, AWS S3, Glue, and Athena to create a real-time streaming pipeline for unstructured data:

Continuous ingestion of TXT, CSV, JSON
Structured transformation using UDFs + Regex
Storage in Parquet on S3 (Bronze layer)
Automatic schema discovery with Glue
Serverless querying via Athena

Next steps:

Transform Bronze layer into Silver/Gold layers
Load processed data into Redshift
Connect to BI tools like QuickSight or Power BI

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
jobs		jobs
venv		venv
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Streaming with Unstructured Data on AWS using Apache Spark

Table of Contents

Project Overview

Architecture

AWS Services

Amazon S3

AWS Glue

AWS Athena

Project Setup

Spark Job Overview

UDFs and Regex Extraction

Running the Pipeline

Locally

Inside Docker Cluster

Conclusion

References

About

Uh oh!

Releases

Packages

Languages

shrutipitale/AWS-real-time-data-streaming-and-analytics-with-unstructured-data

Folders and files

Latest commit

History

Repository files navigation

Real-Time Streaming with Unstructured Data on AWS using Apache Spark

Table of Contents

Project Overview

Architecture

AWS Services

Amazon S3

AWS Glue

AWS Athena

Project Setup

Spark Job Overview

UDFs and Regex Extraction

Running the Pipeline

Locally

Inside Docker Cluster

Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages