A distributed data engineering platform for climate analysis using MapReduce on AWS EMR, featuring a FastAPI backend and modern React frontend for interactive data visualization.
AWS Infrastructure:
- EMR Cluster: 3-node Hadoop cluster for MapReduce processing
- Job 1: Monthly Average Temperatures
- Job 2: Extreme Temperature Classification
- Job 3: Temperature-Precipitation Correlation
- S3 Bucket:
weatheria-climate-datafor data storage - Backend EC2: FastAPI server at http://3.88.63.182:8000
- Frontend EC2: React application at http://54.85.123.135:5173
- Overview
- Architecture
- Prerequisites
- Quick Start
- Project Structure
- MapReduce Jobs
- API Documentation
- Frontend Application
- AWS EMR Deployment
- Results and Findings
- Troubleshooting
- References
Weatheria is a complete ETL (Extract, Transform, Load) pipeline for weather data analysis, built for distributed processing at scale. The system analyzes 3 years of daily weather data (2022-2024) from Medellin, Colombia using MapReduce on AWS EMR, exposing results through a REST API and visualizing them in an interactive web dashboard.
Key Features:
- Data Collection: 1,095 daily records from Open-Meteo Historical Weather API
- Distributed Processing: MapReduce jobs on AWS EMR cluster (3 nodes)
- REST API: FastAPI backend with automatic documentation
- Interactive Visualization: React + TypeScript frontend with real-time charts
- Cloud-Native: Fully deployed on AWS (EMR, S3)
- Python 3.11+ - MapReduce jobs and API backend
- Node.js 18+ - Frontend development
- pip - Python package manager
- npm - Node.js package manager
- AWS Account with EMR permissions (AWS Academy account supported)
- AWS CLI configured with credentials
- S3 Bucket for data storage
- Git Bash (Windows) or standard terminal (Linux/Mac)
git clone https://github.com/Youngermaster/Weatheria.git
cd WeatheriaInstall Python dependencies and start the API server:
# Install dependencies
pip install -r requirements.txt
# Start FastAPI server
cd src/api
python main.pyThe API will be available at http://localhost:8000
Interactive documentation: http://localhost:8000/docs
Install Node.js dependencies and start the development server:
# Navigate to frontend directory
cd weatheria-frontend
# Install dependencies
npm install
# Start development server
npm run devThe frontend will be available at http://localhost:5173
For production deployment on AWS EMR, follow the detailed guide in DEPLOYMENT.md.
Quick deployment summary:
# 1. Download weather data
python scripts/download_data.py
# 2. Setup S3 bucket and upload data
bash scripts/aws/setup_s3.sh
# 3. Create EMR cluster
bash scripts/aws/create_emr_cluster.sh
# 4. Submit MapReduce jobs
bash scripts/aws/submit_emr_jobs_mrjob.sh monthly
bash scripts/aws/submit_emr_jobs_mrjob.sh extreme
bash scripts/aws/submit_emr_jobs_mrjob.sh correlation
# 5. Download results
bash scripts/aws/download_results.sh
# 6. Terminate cluster (IMPORTANT - avoid charges)
bash scripts/aws/terminate_emr_cluster.shweatheria/
├── data/
│ ├── raw/ # Raw weather data from API
│ │ └── medellin_weather_2022-2024.csv
│ └── processed/ # Processed MapReduce outputs
├── src/
│ ├── mapreduce/ # MapReduce job implementations
│ │ ├── monthly_avg_temp.py # Monthly temperature analysis
│ │ ├── extreme_temps.py # Temperature classification
│ │ └── temp_precipitation.py # Correlation analysis
│ └── api/ # FastAPI backend
│ ├── main.py # Application entry point
│ ├── config.py # Configuration settings
│ ├── models/
│ │ └── schemas.py # Pydantic data models
│ └── routers/
│ ├── monthly.py # Monthly averages endpoint
│ ├── extremes.py # Extreme temperatures endpoint
│ └── correlation.py # Correlation endpoint
├── weatheria-frontend/ # React TypeScript application
│ ├── src/
│ │ ├── services/
│ │ │ └── api.ts # Backend API client (Axios)
│ │ ├── pages/
│ │ │ ├── Dashboard.tsx # Main dashboard with charts
│ │ │ ├── MonthlyAnalysis.tsx # Monthly temperature analysis
│ │ │ ├── ExtremeAnalysis.tsx # Temperature distribution
│ │ │ ├── PrecipitationAnalysis.tsx
│ │ │ └── About.tsx
│ │ ├── components/
│ │ │ ├── DashboardLayout.tsx
│ │ │ ├── StatCard.tsx
│ │ │ └── ui/ # shadcn/ui components
│ │ ├── types/
│ │ │ └── index.ts # TypeScript interfaces
│ │ └── lib/
│ │ └── utils.ts # Utility functions
│ ├── package.json
│ ├── vite.config.ts
│ └── tailwind.config.js
├── scripts/
│ ├── download_data.py # Data collection from Open-Meteo API
│ └── aws/ # AWS deployment automation
│ ├── setup_s3.sh
│ ├── create_emr_cluster.sh
│ ├── submit_emr_jobs_mrjob.sh
│ ├── download_results.sh
│ └── terminate_emr_cluster.sh
├── output/ # MapReduce results (CSV files)
│ ├── monthly_avg_fixed.csv
│ ├── extreme_temps_fixed.csv
│ └── temp_precip_fixed.csv
├── requirements.txt # Python dependencies
├── DEPLOYMENT.md # Detailed AWS EMR deployment guide
└── README.md # This file
All MapReduce jobs are implemented using the MRJob Python framework and designed for AWS EMR execution.
Calculates average maximum and minimum temperatures per month across 3 years.
Algorithm:
# Map phase: Extract year-month and temperatures
(year-month) → (max_temp, min_temp, count)
# Reduce phase: Calculate averages
(year-month) → (avg_max_temp, avg_min_temp)Input: Daily weather records (1,095 days) Output: Monthly aggregates (36 months)
month,avg_max_temp,avg_min_temp
2022-01,25.85,14.30
2022-02,27.60,15.00
2022-03,28.45,15.80Execution time on EMR: ~40 seconds
Categorizes days by temperature into 4 categories based on maximum temperature.
Categories:
- Very Cool: max_temp < 20°C (cold days)
- Cool: 20°C ≤ max_temp < 27°C (pleasant days)
- Normal: 27°C ≤ max_temp < 30°C (typical tropical weather)
- Very Hot: max_temp ≥ 30°C (heat wave days)
Algorithm:
# Map phase: Classify each day
temperature → category
# Reduce phase: Count per category
category → total_countOutput: Count per category
category,count
very_cool,6
cool,380
normal,700
very_hot,23Execution time on EMR: ~17 seconds
Analyzes the relationship between temperature and precipitation by month.
Algorithm:
# Map phase: Extract monthly data
(year-month) → (temperature, precipitation)
# Reduce phase: Calculate correlation coefficient
(year-month) → correlationOutput: Monthly correlation coefficients
month,correlation
2022-01,-0.31
2022-02,0.14
2022-03,-0.22Interpretation:
- Negative correlation: Higher temperatures → Lower precipitation
- Positive correlation: Higher temperatures → Higher precipitation
- Values range from -1.0 to +1.0
Execution time on EMR: ~29 seconds
The FastAPI backend provides RESTful endpoints for accessing processed climate data.
Base URL: http://localhost:8000
GET /monthly-avgReturns monthly average temperatures (max and min) for 36 months.
Response:
[
{
"month": "2022-01",
"avg_max_temp": 25.85,
"avg_min_temp": 14.30
}
]GET /extreme-tempsReturns count of days per temperature category.
Response:
[
{
"category": "very_cool",
"count": 6
},
{
"category": "cool",
"count": 380
},
{
"category": "normal",
"count": 700
},
{
"category": "very_hot",
"count": 23
}
]GET /temp-precipitationReturns monthly correlation between temperature and precipitation.
Response:
[
{
"month": "2022-01",
"correlation": -0.31
}
]GET /statsReturns general dataset statistics.
GET /healthAPI health check endpoint.
GET /download/{type}Download raw CSV results. Types: monthly-avg, extreme-temps, temp-precipitation
FastAPI automatically generates interactive API documentation:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
The API allows cross-origin requests from all origins for development. For production, configure specific origins in src/api/config.py.
Modern React + TypeScript single-page application with interactive data visualizations.
- React 19.2.0 - UI framework
- TypeScript 5.7+ - Type safety
- Vite 7.2.4 - Build tool and dev server
- Recharts 3.4.1 - Data visualization library
- Tailwind CSS - Utility-first CSS framework
- Axios 1.13.2 - HTTP client
- Lucide React - Icon library
- shadcn/ui - UI component library
Main overview with 4 visualization cards:
- Temperature Trends: Line chart showing monthly max/min temperatures over 3 years
- Temperature Distribution: Pie chart showing distribution of temperature categories
- Precipitation Patterns: Bar chart of monthly precipitation
- Correlation Analysis: Scatter plot of temperature vs. precipitation
Detailed monthly temperature analysis:
- Area chart with max/min temperature ranges
- Monthly statistics table
- Temperature trend identification
Temperature category distribution:
- Bar chart showing count per category
- Pie chart showing percentage distribution
- Category definitions and insights
Temperature-precipitation correlation:
- Scatter plots by month
- Correlation coefficient visualization
- Seasonal pattern identification
Project information and methodology
The frontend uses Axios to communicate with the FastAPI backend:
// src/services/api.ts
const weatheriaApi = {
getMonthlyAverages: () => axios.get('/monthly-avg'),
getExtremeTemperatures: () => axios.get('/extreme-temps'),
getTemperaturePrecipitation: () => axios.get('/temp-precipitation'),
getStats: () => axios.get('/stats')
}Data Flow:
- User navigates to page
- React component calls API service method
- Axios fetches data from backend
- Data is typed with TypeScript interfaces
- Recharts renders interactive visualization
- User interacts with charts (hover, zoom, filter)
# Install dependencies
cd weatheria-frontend
npm install
# Start dev server with hot reload
npm run dev
# Build for production
npm run build
# Preview production build
npm run previewThe project was successfully deployed on AWS Elastic MapReduce for distributed data processing.
Cluster Configuration:
- Cluster ID: j-3FG55B8H77VI3
- EMR Release: 6.10.0 (Hadoop 3.3.3, Python 3.9)
- Instance Type: m5.xlarge (4 vCPU, 16 GB RAM)
- Instance Count: 3 nodes (1 master, 2 core)
- Region: us-east-1 (N. Virginia)
- S3 Bucket: weatheria-climate-data
Processing Results:
- Total Processing Time: ~2 minutes
- Job 1 - Monthly Avg: 40 seconds
- Job 2 - Extreme Temps: 17 seconds
- Job 3 - Correlation: 29 seconds
Cost Analysis:
- Instance Cost: $0.50/hour per m5.xlarge node
- Total Cluster Cost: $1.50/hour (3 nodes)
- Actual Runtime: ~0.5 hours
- Total Cost: ~$0.75 (less than $1 for complete processing)
- AWS account with EMR permissions
- AWS CLI installed and configured
- Python 3.11+ with required packages
- S3 bucket created
python scripts/download_data.pyDownloads 1,095 daily records from Open-Meteo API.
bash scripts/aws/setup_s3.shCreates bucket structure and uploads data/scripts to S3.
bash scripts/aws/create_emr_cluster.shProvisions 3-node EMR cluster. Note the cluster ID from output.
# Submit all three jobs
bash scripts/aws/submit_emr_jobs_mrjob.sh monthly
bash scripts/aws/submit_emr_jobs_mrjob.sh extreme
bash scripts/aws/submit_emr_jobs_mrjob.sh correlation
# Monitor job progress
aws emr describe-step --cluster-id j-XXXXXXXXXXXXX --step-id s-XXXXXXXXXXXXXbash scripts/aws/download_results.shSyncs results from S3 to local output/ directory.
bash scripts/aws/terminate_emr_cluster.shTerminates cluster to avoid ongoing charges.
For comprehensive deployment instructions including:
- AWS CLI installation (Windows/Linux/Mac)
- AWS credentials configuration (Academic accounts)
- MRJob configuration
- Manual AWS console setup
- Troubleshooting common issues
See DEPLOYMENT.md for the complete guide.
If using AWS Academy:
- Get temporary credentials from AWS Details
- Include
aws_session_tokenin credentials file - Credentials expire after ~3 hours - refresh as needed
- Use default region: us-east-1
- Records: 1,095 daily observations
- Period: January 1, 2022 - December 31, 2024
- Location: Medellin, Colombia (6.25°N, 75.56°W)
- Source: Open-Meteo Historical Weather API
Warmest Period:
- Month: May 2022
- Average Max Temperature: 29.15°C
- Insight: Peak of dry season
Coolest Period:
- Month: November 2022
- Average Min Temperature: 14.13°C
- Insight: Rainy season trough
Overall Temperature Range:
- Maximum Temperatures: 24.6°C - 29.15°C (monthly averages)
- Minimum Temperatures: 14.13°C - 16.75°C (monthly averages)
- Characteristic: Stable tropical climate with minimal seasonal variation
| Category | Days | Percentage | Description |
|---|---|---|---|
| Very Cool (< 20°C) | 6 | <1% | Rare cold fronts |
| Cool (20-27°C) | 380 | 35% | Pleasant weather |
| Normal (27-30°C) | 700 | 64% | Typical tropical |
| Very Hot (≥ 30°C) | 23 | 2% | Heat waves |
Key Insight: Medellin exhibits remarkable temperature stability, with 64% of days experiencing "normal" tropical temperatures (27-30°C). Extreme temperature events are rare.
Correlation Range: -0.64 to +0.14
Pattern Analysis:
- Negative Correlation Months: Temperature inversely related to precipitation (hotter = drier)
- Positive Correlation Months: Temperature directly related to precipitation (hotter = wetter)
- Overall Trend: Generally weak correlation, indicating complex climate dynamics
Example Months:
- January 2022: -0.31 (dry season pattern)
- February 2022: +0.14 (transition period)
- October 2023: -0.52 (strong inverse relationship)
-
Stable Tropical Climate: Medellin exhibits minimal seasonal temperature variation compared to temperate regions.
-
Elevation Effect: At 1,495m elevation, Medellin experiences cooler temperatures than typical equatorial locations.
-
Bimodal Rainfall Pattern: Weak temperature-precipitation correlations suggest complex interactions between Pacific/Caribbean weather systems.
-
Heat Resilience: Only 2% of days exceed 30°C, indicating natural climate moderation.
Solution:
# Check available subnets
aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-xxxxxxxx"
# Use subnet from output in cluster creation
aws emr create-cluster --ec2-attributes SubnetId=subnet-xxxxxxxx ...Solution:
# Verify AWS credentials
aws sts get-caller-identity
# Check S3 bucket permissions
aws s3api get-bucket-acl --bucket weatheria-climate-data
# Update bucket policy if needed
aws s3api put-bucket-policy --bucket weatheria-climate-data --policy file://policy.jsonSolution:
# Add bootstrap action to install dependencies
aws emr create-cluster \
--bootstrap-actions Path=s3://weatheria-climate-data/scripts/bootstrap.shSolution:
- Return to AWS Academy > Learner Lab
- Click "AWS Details" > "Show" credentials
- Update
~/.aws/credentialswith new token - Retry AWS command
Solution:
# Check if CSV files exist
ls output/monthly_avg_fixed.csv output/extreme_temps_fixed.csv output/temp_precip_fixed.csv
# Verify CSV encoding (should be UTF-8)
file output/monthly_avg_fixed.csv
# Re-download results if corrupted
bash scripts/aws/download_results.shSolution:
Edit src/api/config.py:
allow_origins = ["http://localhost:5173"] # Specify frontend URLSolutions:
- Verify backend is running:
curl http://localhost:8000/health- Check API base URL in
src/services/api.ts:
const API_BASE_URL = 'http://localhost:8000';- Check browser console for CORS errors - see Backend API Issues above
Solution:
# Clear npm cache
npm cache clean --force
# Delete node_modules and lock file
rm -rf node_modules pnpm-lock.yaml
# Reinstall
npm installSolution:
- Open browser console (F12) - check for errors
- Verify data format matches TypeScript interfaces
- Ensure Recharts is installed:
npm list recharts
Solution:
# In scripts/download_data.py, increase timeout
response = requests.get(url, timeout=60) # Increase from 30 to 60 seconds
# Add retry logic
from time import sleep
for attempt in range(3):
try:
response = requests.get(url, timeout=60)
break
except requests.Timeout:
if attempt < 2:
sleep(10)
else:
raise- Check logs: EMR cluster logs in S3 (
s3://weatheria-climate-data/logs/) - Monitor costs: Use AWS Cost Explorer to track spending
- Terminate clusters: Always terminate EMR clusters after use
- Use Git: Commit changes frequently during development
- Test locally: Run MapReduce jobs with small data samples before EMR deployment
- Open-Meteo Historical Weather API: https://open-meteo.com/
- Documentation: https://open-meteo.com/en/docs/historical-weather-api
- Apache Hadoop: https://hadoop.apache.org/
- MRJob Documentation: https://mrjob.readthedocs.io/
- AWS EMR Guide: https://docs.aws.amazon.com/emr/
- FastAPI Documentation: https://fastapi.tiangolo.com/
- React Documentation: https://react.dev/
- Recharts Documentation: https://recharts.org/
- White, T. (2015). Hadoop: The Definitive Guide (4th ed.). O'Reilly Media.
- Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107-113.
