Skip to content

JeffWilliams2/aws-emr-spark-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

AWS EMR Spark Data Processing Demo

A Spark-based data processing application on Amazon EMR demonstrating distributed computing, data transformations, and aggregations on historical stock data.

Project Overview

Spark application running on AWS EMR to process historical stock data for Starbucks Corporation (SBUX). Demonstrates practical experience with Apache Spark, AWS cloud infrastructure, and data transformation workflows.

Technical Highlights:

  • Spark application deployed on managed EMR cluster
  • Data transformations using PySpark DataFrames and SQL
  • Aggregations and statistical analysis with window functions
  • Parquet output format for optimized storage

Architecture

The pipeline leverages AWS managed services for scalable, cost-effective data processing:

  • Amazon EMR: Managed Spark cluster (emr-7.10.0) with YARN resource management
  • Amazon S3: Distributed storage with Parquet format for optimized query performance
  • Apache Spark: Adaptive query execution with optimized configurations
  • Amazon VPC: Secure network isolation with custom routing

Tech Stack

Compute: AWS EMR • Apache Spark 3.5.1 • PySpark
Storage: Amazon S3 • Parquet
Infrastructure: VPC • EC2 • YARN

Implementation Details

Data Processing

# Load CSV from S3
df = spark.read.option("header", "true").csv(data_source)

# Transform and clean data
df_cleaned = df.select(
    to_date(col("Date"), "MM/dd/yyyy").alias("date"),
    regexp_replace(col("Close/Last"), "\\$", "").cast(DoubleType()).alias("close_price"),
    regexp_replace(col("Open"), "\\$", "").cast(DoubleType()).alias("open_price"),
    col("Volume").cast("integer").alias("volume")
).filter(col("date").isNotNull())

# Feature engineering
df_enhanced = df_cleaned.select(
    "*",
    ((col("close_price") - col("open_price")) / col("open_price") * 100).alias("daily_change_pct"),
    when(col("volume") > 10000000, "High Volume").otherwise("Low Volume").alias("volume_category")
)

# Aggregations
monthly_summary = df_enhanced.groupBy(
    date_format(col("date"), "yyyy-MM").alias("month")
).agg(
    avg("close_price").alias("avg_close"),
    avg("volume").alias("avg_volume"),
    avg("daily_change_pct").alias("avg_change_pct")
).orderBy(desc("month"))

Sample Output

Monthly Aggregations:

+-------+------------------+------------------+------------------+
|  month|         avg_close|        avg_volume|    avg_change_pct|
+-------+------------------+------------------+------------------+
|2025-10| 81.58263157894738|9385421.4473684...|0.2156842105263...|
|2025-09| 85.06095238095237|8827095.571428571|-0.008571428571...|
|2025-08| 89.46238095238096|7962885.047619048| 0.055238095238...|
|2025-07| 90.32454545454546|7234995.090909091|0.0604545454545...|
|2025-06| 91.35904761904762|8234066.619047619|0.0319047619047...|
+-------+------------------+------------------+------------------+

Statistics Computed:

Computing basic statistics...
Average price: $84.56
Max price: $102.89
Min price: $71.51
Average volume: 8,543,291
High volatility days (>3% change): 47

Pipeline Execution

Successful Spark Jobs

YARN ResourceManager showing successful Spark application executions

Deployment

# Submit Spark job to EMR cluster
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
  --steps Type=Spark,Name="StockAnalysis",\
  Args=[s3://bucket/main.py,\
  --data_source,s3://bucket/input/sbux-historical-data.csv,\
  --output_uri,s3://bucket/output/]

EMR Cluster:

  • 1 primary + 2 core nodes (m5.xlarge)
  • Spark 3.5.1, Hadoop 3.4.1
  • Custom VPC with managed scaling

Project Structure

aws-emr-demo/
├── main.py                   # PySpark transformation script
├── sbux-historical-data.csv  # Sample SBUX stock data
└── README.md

Skills Demonstrated

Apache Spark: DataFrame API, SQL aggregations, transformations
AWS EMR: Cluster management, job submission, monitoring
Data Processing: Type conversions, feature engineering, aggregations
Cloud Infrastructure: S3 integration, VPC configuration, YARN

References


Spark data processing demo on AWS EMR

About

Spark data processing on AWS EMR: PySpark transformations, SQL aggregations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages