AWS EMR Spark Data Processing Demo

A Spark-based data processing application on Amazon EMR demonstrating distributed computing, data transformations, and aggregations on historical stock data.

Project Overview

Spark application running on AWS EMR to process historical stock data for Starbucks Corporation (SBUX). Demonstrates practical experience with Apache Spark, AWS cloud infrastructure, and data transformation workflows.

Technical Highlights:

Spark application deployed on managed EMR cluster
Data transformations using PySpark DataFrames and SQL
Aggregations and statistical analysis with window functions
Parquet output format for optimized storage

Architecture

The pipeline leverages AWS managed services for scalable, cost-effective data processing:

Amazon EMR: Managed Spark cluster (emr-7.10.0) with YARN resource management
Amazon S3: Distributed storage with Parquet format for optimized query performance
Apache Spark: Adaptive query execution with optimized configurations
Amazon VPC: Secure network isolation with custom routing

Tech Stack

Compute: AWS EMR • Apache Spark 3.5.1 • PySpark
Storage: Amazon S3 • Parquet
Infrastructure: VPC • EC2 • YARN

Implementation Details

Data Processing

# Load CSV from S3
df = spark.read.option("header", "true").csv(data_source)

# Transform and clean data
df_cleaned = df.select(
    to_date(col("Date"), "MM/dd/yyyy").alias("date"),
    regexp_replace(col("Close/Last"), "\\$", "").cast(DoubleType()).alias("close_price"),
    regexp_replace(col("Open"), "\\$", "").cast(DoubleType()).alias("open_price"),
    col("Volume").cast("integer").alias("volume")
).filter(col("date").isNotNull())

# Feature engineering
df_enhanced = df_cleaned.select(
    "*",
    ((col("close_price") - col("open_price")) / col("open_price") * 100).alias("daily_change_pct"),
    when(col("volume") > 10000000, "High Volume").otherwise("Low Volume").alias("volume_category")
)

# Aggregations
monthly_summary = df_enhanced.groupBy(
    date_format(col("date"), "yyyy-MM").alias("month")
).agg(
    avg("close_price").alias("avg_close"),
    avg("volume").alias("avg_volume"),
    avg("daily_change_pct").alias("avg_change_pct")
).orderBy(desc("month"))

Sample Output

Monthly Aggregations:

+-------+------------------+------------------+------------------+
|  month|         avg_close|        avg_volume|    avg_change_pct|
+-------+------------------+------------------+------------------+
|2025-10| 81.58263157894738|9385421.4473684...|0.2156842105263...|
|2025-09| 85.06095238095237|8827095.571428571|-0.008571428571...|
|2025-08| 89.46238095238096|7962885.047619048| 0.055238095238...|
|2025-07| 90.32454545454546|7234995.090909091|0.0604545454545...|
|2025-06| 91.35904761904762|8234066.619047619|0.0319047619047...|
+-------+------------------+------------------+------------------+

Statistics Computed:

Computing basic statistics...
Average price: $84.56
Max price: $102.89
Min price: $71.51
Average volume: 8,543,291
High volatility days (>3% change): 47

Pipeline Execution

YARN ResourceManager showing successful Spark application executions

Deployment

# Submit Spark job to EMR cluster
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
  --steps Type=Spark,Name="StockAnalysis",\
  Args=[s3://bucket/main.py,\
  --data_source,s3://bucket/input/sbux-historical-data.csv,\
  --output_uri,s3://bucket/output/]

EMR Cluster:

1 primary + 2 core nodes (m5.xlarge)
Spark 3.5.1, Hadoop 3.4.1
Custom VPC with managed scaling

Project Structure

aws-emr-demo/
├── main.py                   # PySpark transformation script
├── sbux-historical-data.csv  # Sample SBUX stock data
└── README.md

Skills Demonstrated

Apache Spark: DataFrame API, SQL aggregations, transformations
AWS EMR: Cluster management, job submission, monitoring
Data Processing: Type conversions, feature engineering, aggregations
Cloud Infrastructure: S3 integration, VPC configuration, YARN

References

Spark data processing demo on AWS EMR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AWS EMR Spark Data Processing Demo

Project Overview

Architecture

Tech Stack

Implementation Details

Data Processing

Sample Output

Pipeline Execution

Deployment

Project Structure

Skills Demonstrated

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
img		img
README.md		README.md
main.py		main.py

JeffWilliams2/aws-emr-spark-demo

Folders and files

Latest commit

History

Repository files navigation

AWS EMR Spark Data Processing Demo

Project Overview

Architecture

Tech Stack

Implementation Details

Data Processing

Sample Output

Pipeline Execution

Deployment

Project Structure

Skills Demonstrated

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages