You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.
Built Apache Airflow DAGs to automate Yahoo Finance stock data ingestion, storage, and querying, then extended with a Python log analyzer to monitor execution errors. Demonstrates orchestration, scheduling, operator use, and pipeline monitoring.
Near real-time data replication pipeline from MySQL to PostgreSQL for analytics, reporting, and downstream systems. Supports inserts, updates, deletes, and initial snapshot replication.
To use dbt as an orchestration tool to process a static file and join two data sources together. This repository can be used as a template example of creating a dbt pipeline with testing. See the two simple sets below to using the dbt pipeline to generate tables in BigQuery (GCP).
End-to-end data engineering pipeline for analyzing heart attack prediction in Indonesia. Automates data ingestion, transformation, and visualization using GCP (Terraform, BigQuery, Cloud Storage), Apache Airflow, dbt, and Python scripts. Provides actionable insights via Power BI dashboards.
GCP-based Regulatory Reporting Lakehouse — Tier-1 Swiss Bank (Simulated Case Study):- Documentation-only repo illustrating a cloud-native data lakehouse architecture for regulatory reporting on Google Cloud Platform (GCS + BigQuery + Dataflow + Composer). Includes ADRs, runbooks, and compliance data contracts.
A modular, contract-driven HPC pipeline for converting SRA archives to FASTQ using a pinned SRA Toolkit, featuring strict preflight validation, an explicit SLURM execution ABI, parallel array processing, and restart-safe per-accession isolation.
Clinical cohort construction and metric derivation using DuckDB SQL. Explores healthcare encounter data, profiles keys and dates, identifies duplication patterns, and prepares a reproducible pipeline for building an analysis-ready cohort dataset.
A modular, contract-driven HPC pipeline for deterministic alignment of paired-end FASTQ data using BWA and samtools, producing sorted and indexed BAM files with strict SLURM execution, explicit ABI contracts, and comprehensive preflight validation.
A contract‑driven HPC pipeline for FASTQ trimming using BBDUK or Trimmomatic, featuring strict preflight validation, explicit SLURM ABI propagation, deterministic tool installation, and restart‑safe per-sample execution.