This project demonstrates a real-time streaming pipeline for processing unstructured data using Apache Spark, AWS S3, AWS Glue, and AWS Athena. It continuously consumes TXT, CSV, and JSON files, converts them into a structured format, and stores them in Parquet on S3 for analytics.
- Project Overview
- Architecture
- AWS Services
- Project Setup
- Spark Job Overview
- UDFs and Regex Extraction
- Running the Pipeline
- Conclusion
- References
Unstructured data cannot be displayed in rows and columns like relational databases. Examples include:
- Text files, JSON, CSV
- Spreadsheets
- Logs and job postings
Challenges of unstructured data:
- Requires more storage
- Harder to manage and protect with legacy solutions
This project builds a scalable, fault-tolerant pipeline to process unstructured data in real-time and store it in a structured format.
The project uses a Spark Master-Worker cluster running on Docker:
- spark-master: Master node managing the cluster
- spark-worker-1/2/3: Worker nodes executing Spark jobs
Docker Compose configuration spins up the cluster:
version: "3.8"
services:
spark-master:
image: bitnami/spark:3.5.1
container_name: spark-master
environment:
SPARK_MODE: master
ports:
- "9090:8080"
- "7077:7077"
- "4040:4040"
volumes:
- ./jobs:/opt/bitnami/spark/jobs
networks:
- spark-network
restart: always
spark-worker-1:
image: bitnami/spark:3.5.1
environment:
SPARK_MODE: worker
SPARK_MASTER_URL: spark://spark-master:7077
networks:
- spark-network
restart: always
# ...additional worker nodes similar to worker-1
networks:
spark-network:
driver: bridgeRun the cluster:
docker compose up -d- Stores raw unstructured files (JSON, TXT)
- Acts as a data lake for processed data
- Output stored as Parquet files for analytics
- Fully managed ETL solution
- Uses a crawler to scan S3 and infer schema
- Stores metadata in the Glue Data Catalog
- Creates tables (Bronze layer) for structured data
- Serverless SQL querying over S3
- Queries processed, structured data
- Integrated with Glue Data Catalog for table discovery
- Create project directory:
mkdir aws_spark_unstructured
cd aws_spark_unstructured
python -m venv venv- Activate virtual environment:
# Linux/macOS
source venv/bin/activate
# Windows
venv\Scripts\activate- Install dependencies (if any, e.g., PySpark):
pip install pyspark boto3- Set AWS credentials in
config.py:
configuration = {
'AWS_ACCESS_KEY': '--AWS ACCESS KEY--',
'AWS_SECRET_KEY': '--AWS SECRET KEY--'
}File: main.py
-
Reads TXT and JSON files as streams
-
Applies UDFs + Regex to extract fields like:
- File name, Position, Class Code
- Salary, Start/End dates
- Requirements, Duties, Notes
- Selection, Experience length, Education length
- Application location
-
Union text and JSON streams into a single DataFrame
-
Writes output to S3 in Parquet format with checkpointing
UDF (User-Defined Function):
- Custom Spark function for complex parsing
- Handles unstructured data fields not natively supported by Spark
Regex is used to extract structured information:
- Salary:
₹50,000 - ₹70,000 - Date:
Posted: 2023-08-01 - Company Name, Location, Job requirements, etc.
Pipeline Flow:
- Text and JSON read as streams
- Apply UDFs + Regex to extract structured columns
- Union all sources into a single DataFrame
- Write continuously to S3 Parquet files
spark-submit main.pydocker exec -it spark-master spark-submit --master spark://spark-master:7077 jobs/main.py- The job reads new files from
input_text/andinput_json/folders - Processes and streams them to S3
- Monitor logs for schema and streaming output
This project integrates Apache Spark, AWS S3, Glue, and Athena to create a real-time streaming pipeline for unstructured data:
- Continuous ingestion of TXT, CSV, JSON
- Structured transformation using UDFs + Regex
- Storage in Parquet on S3 (Bronze layer)
- Automatic schema discovery with Glue
- Serverless querying via Athena
Next steps:
- Transform Bronze layer into Silver/Gold layers
- Load processed data into Redshift
- Connect to BI tools like QuickSight or Power BI