I'm a Cloud Data Engineer building scalable, reliable, and cost-efficient cloud data platforms.
I specialize in turning raw, messy, multi-source data into trusted analytics layers and ML-ready pipelines
through a mix of modern ELT, streaming systems, and strong distributed systems fundamentals.
π MSCS @ Northeastern University (2022β2024)
βοΈ Focus: Cloud-Native Data Engineering
π Connect: GitHub: wyang10 β’ LinkedIn: linkedin.com/in/awhy
- Focused on building cloud-native, event-driven data systems on AWS / GCP cloud platform.
- Experienced delivering data platforms and analytics pipelines with data quality and schema governance.
- Strong in reliability engineering (idempotency, DLQ/replay, observability), IaC (Terraform), Kubernetes, and CI/CD.
Data Engineer β LumiereX (Jan 2025 β Present)
- Built event-driven Serverless ELT ingestion on AWS(S3, API Gateway, Lambda, SQS, Glue, Step Functions).
- Improved data quality layers, and optimized Spark jobs for cost/performance.
- Inplemented in reliability engineering (idempotency, DLQ/replay, observability).
Software Engineer Intern β VisionX (Jan 2024 β Jul 2024)
- Contributed to a Kafka β Flink streaming pipeline to enable real-time ML scoring for IoT sensory.
- Focused on modules including schema governance, ingestion reliability, and validation checks.
- Containerized Flink jobs with Docker, deployed to Kubernetes.
- Orchestration: EventBridge β Step Functions β Glue Job + optional Great Expectations gate.
- Catalog / Query: Glue Data Catalog + Crawler + Athena tables for silver/ Parquet.
- Replay / Recovery: replay & dlq-redrive scripts for backfill and poison-message recovery.
- Idempotency: DynamoDB TTL for object-level dedup, optional GSI for audit.
- CI/CD: GitHub Actions pipelines (Lambda build+deploy, Terraform plan+apply).
- End-to-End, Reproducible ML Pipeline Engineered a modular, production-style ML system for predicting in-hospital mortality.
- Go from raw CSV β cleaned features β baseline models β reproducible CLI pipeline, with optional SMOTE to address severe class imbalance.
- A production-ready ELT & Data Quality Framework using Airflow + dbt + Snowflake + Great Expectations + CICD.
- Automates data ingestion, transformation, testing, and lineage into a reproducible orchestration system.
- I design modular, observable pipelines that are easy to test, debug, and scale.
- I prioritize trade-offs that maximize team velocity, reliability, and cloud spend efficiency.
- I enjoy collaborations involving data modeling, pipeline quality, and distributed system design.
Languages & Tools
Python (Pandas, PySpark) β’ SQL β’ Java β’ Bash
Cloud & Orchestration
GCP (BigQuery, Dataflow) β’ AWS (S3, EMR, Glue, Lambda, SQS, Step functions, IAM)
GitHub Actions β’ Airflow β’ dbt β’ Docker β’ Kubernetes β’ Terraform
Big Data & Storage
Spark β’ Kafka β’ Flink β’ Databricks β’ Delta Lake
Snowflake β’ Parquet β’ SCD Type2 β’ dimensional modeling
Data Quality & CI/CD
Great Expectations β’ dbt tests β’ automated lineage β’ monitoring

