Skip to content

shrutipitale/AWS-real-time-data-streaming-and-analytics-with-unstructured-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Streaming with Unstructured Data on AWS using Apache Spark

This project demonstrates a real-time streaming pipeline for processing unstructured data using Apache Spark, AWS S3, AWS Glue, and AWS Athena. It continuously consumes TXT, CSV, and JSON files, converts them into a structured format, and stores them in Parquet on S3 for analytics.


Table of Contents


Project Overview

Unstructured data cannot be displayed in rows and columns like relational databases. Examples include:

  • Text files, JSON, CSV
  • Spreadsheets
  • Logs and job postings

Challenges of unstructured data:

  • Requires more storage
  • Harder to manage and protect with legacy solutions

This project builds a scalable, fault-tolerant pipeline to process unstructured data in real-time and store it in a structured format.


Architecture

The project uses a Spark Master-Worker cluster running on Docker:

  • spark-master: Master node managing the cluster
  • spark-worker-1/2/3: Worker nodes executing Spark jobs

Docker Compose configuration spins up the cluster:

image
version: "3.8"

services:
  spark-master:
    image: bitnami/spark:3.5.1
    container_name: spark-master
    environment:
      SPARK_MODE: master
    ports:
      - "9090:8080"
      - "7077:7077"
      - "4040:4040"
    volumes:
      - ./jobs:/opt/bitnami/spark/jobs
    networks:
      - spark-network
    restart: always

  spark-worker-1:
    image: bitnami/spark:3.5.1
    environment:
      SPARK_MODE: worker
      SPARK_MASTER_URL: spark://spark-master:7077
    networks:
      - spark-network
    restart: always
# ...additional worker nodes similar to worker-1

networks:
  spark-network:
    driver: bridge

Run the cluster:

docker compose up -d

AWS Services

Amazon S3

  • Stores raw unstructured files (JSON, TXT)
  • Acts as a data lake for processed data
  • Output stored as Parquet files for analytics

AWS Glue

  • Fully managed ETL solution
  • Uses a crawler to scan S3 and infer schema
  • Stores metadata in the Glue Data Catalog
  • Creates tables (Bronze layer) for structured data

AWS Athena

  • Serverless SQL querying over S3
  • Queries processed, structured data
  • Integrated with Glue Data Catalog for table discovery

Project Setup

  1. Create project directory:
mkdir aws_spark_unstructured
cd aws_spark_unstructured
python -m venv venv
  1. Activate virtual environment:
# Linux/macOS
source venv/bin/activate
# Windows
venv\Scripts\activate
  1. Install dependencies (if any, e.g., PySpark):
pip install pyspark boto3
  1. Set AWS credentials in config.py:
configuration = {
    'AWS_ACCESS_KEY': '--AWS ACCESS KEY--',
    'AWS_SECRET_KEY': '--AWS SECRET KEY--'
}

Spark Job Overview

File: main.py

  • Reads TXT and JSON files as streams

  • Applies UDFs + Regex to extract fields like:

    • File name, Position, Class Code
    • Salary, Start/End dates
    • Requirements, Duties, Notes
    • Selection, Experience length, Education length
    • Application location
  • Union text and JSON streams into a single DataFrame

  • Writes output to S3 in Parquet format with checkpointing


UDFs and Regex Extraction

UDF (User-Defined Function):

  • Custom Spark function for complex parsing
  • Handles unstructured data fields not natively supported by Spark

Regex is used to extract structured information:

  • Salary: ₹50,000 - ₹70,000
  • Date: Posted: 2023-08-01
  • Company Name, Location, Job requirements, etc.

Pipeline Flow:

  1. Text and JSON read as streams
  2. Apply UDFs + Regex to extract structured columns
  3. Union all sources into a single DataFrame
  4. Write continuously to S3 Parquet files

Running the Pipeline

Locally

spark-submit main.py

Inside Docker Cluster

docker exec -it spark-master spark-submit --master spark://spark-master:7077 jobs/main.py
  • The job reads new files from input_text/ and input_json/ folders
  • Processes and streams them to S3
  • Monitor logs for schema and streaming output

Conclusion

This project integrates Apache Spark, AWS S3, Glue, and Athena to create a real-time streaming pipeline for unstructured data:

  • Continuous ingestion of TXT, CSV, JSON
  • Structured transformation using UDFs + Regex
  • Storage in Parquet on S3 (Bronze layer)
  • Automatic schema discovery with Glue
  • Serverless querying via Athena

Next steps:

  • Transform Bronze layer into Silver/Gold layers
  • Load processed data into Redshift
  • Connect to BI tools like QuickSight or Power BI

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages