🐍 DataPulse: Mexico Air Quality Mini-ETL

DataPulse is a hands-on data engineering project built to practice real ETL workflows using open environmental datasets from Mexico. It extracts hourly air quality measurements from the official SEDEMA dataset, transforms and aggregates them into weekly summaries, and generates automatic reports with plots and key insights.

🌎 Project Overview

This project demonstrates how to design a lightweight, reproducible ETL pipeline in Python — focused on data collection, cleaning, transformation, and visualization. It uses real data published by SEDEMA (Mexico City Air Quality Monitoring System – SIMAT).

Each execution:

Loads the official CSV dataset (hourly pollutants data).
Cleans metadata headers and inconsistent columns.
Aggregates all records by ISO week and pollutant.
Generates a weekly Markdown report with charts.

💡 The goal: automate real-world data processing and reporting using open data from Mexico.

⚙️ Tech Stack

Category	Tools
Language	Python 3.12
Libraries	`pandas`, `matplotlib`, `requests`, `certifi`
Data Source	SEDEMA Air Quality – Hourly Pollutants (Open Data)
Output	CSV summaries + Markdown reports with plots

📂 Repository Structure

datapulse/
 ├── src/
 │   ├── main.py          # Orchestrates the ETL & report generation
 │   ├── etl.py           # Extraction, cleaning & weekly aggregation
 │   └── visualize.py     # Plotting utilities (matplotlib)
 ├── data/
 │   ├── raw/             # Raw downloaded datasets
 │   └── processed/       # Cleaned and aggregated CSV outputs
 ├── reports/             # Weekly Markdown reports + charts
 ├── .gitignore
 ├── README.md
 └── requirements.txt

🚀 How to Run

Clone the repo:

git clone https://github.com/yourusername/mini_etl_con_datos_mexico.git
cd mini_etl_con_datos_mexico

Set up a virtual environment:

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Run the ETL:
```
python src/main.py
```
The script will process the latest dataset (or your local copy) and create:
- A cleaned weekly summary under data/processed/
- A Markdown report with plots under reports/

📊 Example Output

year_week,contaminante,valor_promedio,valor_min,valor_max,mediciones
2024-W10,PM10,46.2,3.0,180.0,2200
2024-W10,PM2.5,22.5,1.0,74.0,2300
2024-W10,O3,27.1,0.0,117.0,4500

Each row represents one pollutant’s average concentration for an ISO week. Markdown reports include weekly summaries and matplotlib visualizations.

💡 Key Learnings

Building modular ETL pipelines (extract → transform → load → visualize)
Handling messy real-world CSVs (metadata headers, encoding, separators)
Time-based aggregation with Pandas (daily/weekly summaries)
Automating simple analytical reports

👩‍💻 Author

Estefanía Marmolejo Sandoval – AI Engineer & Data Enthusiast Passionate about applied AI, data engineering, and open-source analytics. 💼 LinkedIn • 🐙 GitHub

📜 License

This project is open-source under the MIT License.

⚠️ Data courtesy of SEDEMA (Mexico City’s Air Quality Monitoring System – SIMAT). Processed for educational purposes only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐍 DataPulse: Mexico Air Quality Mini-ETL

🌎 Project Overview

⚙️ Tech Stack

📂 Repository Structure

🚀 How to Run

📊 Example Output

💡 Key Learnings

👩‍💻 Author

📜 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
README.md		README.md
requirements.txt		requirements.txt

estefaniams-han/mini_etl_con_datos_mexico

Folders and files

Latest commit

History

Repository files navigation

🐍 DataPulse: Mexico Air Quality Mini-ETL

🌎 Project Overview

⚙️ Tech Stack

📂 Repository Structure

🚀 How to Run

📊 Example Output

💡 Key Learnings

👩‍💻 Author

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages