A Spark-based data processing application on Amazon EMR demonstrating distributed computing, data transformations, and aggregations on historical stock data.
Spark application running on AWS EMR to process historical stock data for Starbucks Corporation (SBUX). Demonstrates practical experience with Apache Spark, AWS cloud infrastructure, and data transformation workflows.
Technical Highlights:
- Spark application deployed on managed EMR cluster
- Data transformations using PySpark DataFrames and SQL
- Aggregations and statistical analysis with window functions
- Parquet output format for optimized storage
The pipeline leverages AWS managed services for scalable, cost-effective data processing:
- Amazon EMR: Managed Spark cluster (emr-7.10.0) with YARN resource management
- Amazon S3: Distributed storage with Parquet format for optimized query performance
- Apache Spark: Adaptive query execution with optimized configurations
- Amazon VPC: Secure network isolation with custom routing
Compute: AWS EMR • Apache Spark 3.5.1 • PySpark
Storage: Amazon S3 • Parquet
Infrastructure: VPC • EC2 • YARN
# Load CSV from S3
df = spark.read.option("header", "true").csv(data_source)
# Transform and clean data
df_cleaned = df.select(
to_date(col("Date"), "MM/dd/yyyy").alias("date"),
regexp_replace(col("Close/Last"), "\\$", "").cast(DoubleType()).alias("close_price"),
regexp_replace(col("Open"), "\\$", "").cast(DoubleType()).alias("open_price"),
col("Volume").cast("integer").alias("volume")
).filter(col("date").isNotNull())
# Feature engineering
df_enhanced = df_cleaned.select(
"*",
((col("close_price") - col("open_price")) / col("open_price") * 100).alias("daily_change_pct"),
when(col("volume") > 10000000, "High Volume").otherwise("Low Volume").alias("volume_category")
)
# Aggregations
monthly_summary = df_enhanced.groupBy(
date_format(col("date"), "yyyy-MM").alias("month")
).agg(
avg("close_price").alias("avg_close"),
avg("volume").alias("avg_volume"),
avg("daily_change_pct").alias("avg_change_pct")
).orderBy(desc("month"))Monthly Aggregations:
+-------+------------------+------------------+------------------+
| month| avg_close| avg_volume| avg_change_pct|
+-------+------------------+------------------+------------------+
|2025-10| 81.58263157894738|9385421.4473684...|0.2156842105263...|
|2025-09| 85.06095238095237|8827095.571428571|-0.008571428571...|
|2025-08| 89.46238095238096|7962885.047619048| 0.055238095238...|
|2025-07| 90.32454545454546|7234995.090909091|0.0604545454545...|
|2025-06| 91.35904761904762|8234066.619047619|0.0319047619047...|
+-------+------------------+------------------+------------------+
Statistics Computed:
Computing basic statistics...
Average price: $84.56
Max price: $102.89
Min price: $71.51
Average volume: 8,543,291
High volatility days (>3% change): 47
YARN ResourceManager showing successful Spark application executions
# Submit Spark job to EMR cluster
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
--steps Type=Spark,Name="StockAnalysis",\
Args=[s3://bucket/main.py,\
--data_source,s3://bucket/input/sbux-historical-data.csv,\
--output_uri,s3://bucket/output/]EMR Cluster:
- 1 primary + 2 core nodes (m5.xlarge)
- Spark 3.5.1, Hadoop 3.4.1
- Custom VPC with managed scaling
aws-emr-demo/
├── main.py # PySpark transformation script
├── sbux-historical-data.csv # Sample SBUX stock data
└── README.md
Apache Spark: DataFrame API, SQL aggregations, transformations
AWS EMR: Cluster management, job submission, monitoring
Data Processing: Type conversions, feature engineering, aggregations
Cloud Infrastructure: S3 integration, VPC configuration, YARN
Spark data processing demo on AWS EMR
