Skip to content

mtholahan/springboard-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📊 Springboard Data Engineering Portfolio

Welcome! I'm a data engineer trained via the Springboard Data Engineering Bootcamp with hands-on projects across Azure, SQL, Python, Airflow, and more.

I build robust, scalable data pipelines and solutions that bring order to complex data environments — ready for production and performance.


🚀 Project Timeline

The table below is auto-generated from my SQL Server progress tracker (tblMiniProjectProgress) via a custom Python workflow.

Project Description Repository Link Last Update
Guided Capstone Project This guided capstone builds an end-to-end data engineering pipeline for high-frequency equity market data. It designs a relational schema for trade and quote records, ingests daily CSV and JSON files into Spark, and performs batch ETL operations with deduplication and partitioning. The pipeline computes analytical metrics—such as trade indicators, 30-minute moving averages, and bid/ask price movements—and stores results in cloud-based data layers for market trend analysis. GitHub Repo 11/11/2025
Unguided Capstone Project This unguided capstone investigates how the diversity of movie soundtrack genres correlates with audience reception and popularity. Data from The Movie Database (TMDb) and Discogs APIs is integrated to create a unified dataset linking films to their soundtracks. The project uses Python, SQL, and Spark-based ETL pipelines to extract, transform, and analyze relationships between genre variety, release era, and popularity metrics. GitHub Repo 11/08/2025
Kafka Mini Project Built a streaming fraud detection system with Apache Kafka and Python. Deployed a Kafka cluster via Docker Compose, implemented a transaction generator and fraud detector using kafka-python, and routed suspicious transactions to separate topics for real-time monitoring. Demonstrates event streaming, producers, consumers, and containerization. GitHub Repo 09/11/2025
Apache Airflow Log Analyzer Mini Project Built Apache Airflow DAGs to automate Yahoo Finance stock data ingestion, storage, and querying, then extended with a Python log analyzer to monitor execution errors. Demonstrates orchestration, scheduling, operator use, and pipeline monitoring. GitHub Repo 08/31/2025
Apache Spark Optimization Mini Project Optimized PySpark jobs by analyzing query execution plans and rewriting transformations for efficiency. Applied techniques such as reducing shuffles, tuning partitions, selecting efficient operators, and choosing optimal data formats. Demonstrates performance tuning for large-scale Spark ETL workloads using Python and PySpark. GitHub Repo 08/08/2025
Apache Spark Post Sales Redesign Mini Project Redesigned a Hadoop MapReduce post-sales reporting system using Spark. Processed automobile incident data to add make/year attributes and aggregate accidents by vehicle. Implemented RDD transformations, groupByKey, and reduceByKey to generate reports efficiently, highlighting Spark’s performance advantage over MapReduce. GitHub Repo 08/05/2025
Azure Synaspe Analytics Mini Project Built a data pipeline in Azure Synapse Analytics to load product data from Azure Data Lake into a dedicated SQL pool. Implemented data flow with inserts and upserts, handling schema drift and type 1 SCD updates, and orchestrated ingestion using Synapse Studio pipelines. GitHub Repo 07/18/2025
Azure DataBricks Mini Project Implemented a PySpark mini-project in Azure Databricks to ingest, query, and transform datasets. Built solutions using PySpark DataFrame syntax rather than SparkSQL, demonstrating data ingestion, transformations, and query patterns within notebooks submitted as part of the Springboard boot camp. GitHub Repo 07/16/2025
MySQL Python Data Pipeline Mini Project Developed a Python and SQL data pipeline for an event ticketing system. Designed a MySQL table schema, ingested CSV sales data via Python connectors, and implemented queries to analyze ticket popularity and sales trends, showcasing ETL and database integration skills. GitHub Repo 07/14/2025
PostgreSQL Tuning Mini Project Optimized PostgreSQL queries on a computer science publications dataset. Created tables, ingested CSVs, and wrote queries to analyze conferences, authors, and publication trends. Improved performance by designing indexes, refining join/filter logic, and evaluating execution plans with EXPLAIN, demonstrating query tuning and indexing strategies. GitHub Repo 03/21/2025
Advanced MySQLQuery Tuning Mini Project Analyzed EuroCup 2016 data with advanced SQL queries. Imported CSV datasets into MySQL, designed schema with match, player, and referee details, and implemented queries covering match outcomes, penalty shootouts, player stats, bookings, substitutions, and referee activity to explore tournament dynamics. GitHub Repo 03/08/2025
Python OOP Mini Project Implemented a simplified banking system in Python using OOP principles. Modeled customers, accounts, employees, and services such as loans and credit cards. Applied PEP-8 style, logging, and exception handling, with UML-based design and a command-line interface for deposits, withdrawals, and account management. GitHub Repo 02/13/2025

🏷️ Tags

#SQL #Azure #Airflow #Spark #Kafka #DataPipeline #ETL #DataEngineering #Monitoring #Streaming #Automation


📚 Bootcamp Summary

  • 📅 35+ weeks of guided, project-based curriculum
  • ✏️ 10 mini-projects + 1 guided and 1 unguided capstone
  • 🌐 Focus: cloud computing, big data, orchestration, performance optimization
  • ✅ Verified by mentor checkpoints and progress metrics

🛠️ Skills & Tools

🧰 Core Stack

Spark Python PySpark Azure SQL Databricks

🛠️ Supporting Tools

Airflow Docker Kafka MySQL Jupyter PostgreSQL

Additional Tags (click to expand)

Accident-Reporting Aggregation Api-Integration Automobile Banking Batch-Processing
Big-Data Bootcamp Cli Cloud Consumers Csv
Dag Data-Analysis Data-Engineering Data-Ingestion Data-Lake Data-Modeling
Data-Orchestration Data-Pipeline Data-Visualization Database Design-Patterns Discogs
Docker-Compose Etl Etl-Pipeline Eurocup Event-Driven Exception-Handling
Finance Financial-Data Football Fraud-Detection Indexing Json
Logging Mapreduce Monitoring Movie-Analytics Oopp Optimization
Parquet Partitioning Performance Performance-Tuning Producers Publications
Queries Query-Optimization Rdd Research-Papers Soccer Soundtrack-Analysis
Sports Springboard Stock-Market Streaming Synapse Tmdb
UML

Tools used in real projects: data pipelines, cloud orchestration, SQL optimization, and dashboarding.


📬 Let’s Connect

📧 Reach me on LinkedIn
🧠 Ask me about boot camp time tracking, SQL optimization, or orchestration frameworks!

Generated automatically via Python on 11-11-2025 18:23:50

About

A meta-repo for my Springboard data engineering boot camp projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages