#

data-pipeline

Here are 27 public repositories matching this topic...

openbridge / ob_bulkstash

Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.

docker sync docker-image s3 google-cloud-storage google-cloud rclone docker-service amazon-web-services data-pipeline oracle-cloud storage-service sftp-synchronisation docker-rclone

Updated Sep 21, 2020
Shell

dbt-snowplow-web

snowplow / dbt-snowplow-web

A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.

analytics data-model dbt data-pipeline snowplow-analytics

Updated Apr 7, 2026
Shell

confluentinc / learn-kafka-courses

Learn the basics of Apache Kafka® from leaders in the Kafka community with these video courses covering the Kafka ecosystem and hands-on exercises.

kafka stream-processing apache-kafka kafka-streams data-pipelines data-pipeline ksqldb

Updated Aug 29, 2025
Shell

hellomaxime / data-platform-on-kubernetes

Open Source Data Platform on Kubernetes

kubernetes open-source platform data spark etl bigdata superset ml druid dbt data-pipeline

Updated Apr 22, 2024
Shell

OtmaneDaoudi / finnhub-data-streaming-pipline

Finnhub data streaming pipeline for real-time Bitcoin trades analysis.

cloud data-engineering spark-streaming data-pipeline etl-pipeline

Updated Aug 11, 2024
Shell

mtholahan / apache-airflow-mini-project

Built Apache Airflow DAGs to automate Yahoo Finance stock data ingestion, storage, and querying, then extended with a Python log analyzer to monitor execution errors. Demonstrates orchestration, scheduling, operator use, and pipeline monitoring.

python airflow monitoring etl logging data-engineering bootcamp springboard dag data-pipeline

Updated Nov 11, 2025
Shell

tcd93 / invoice-data-pipeline

A sample data pipeline for transforming invoice images and CSV files into beautiful numbers

python kubernetes airflow data-pipeline trino

Updated Feb 6, 2025
Shell

yo-resistor / immune-gene-viewer

🧬 A Streamlit-based web app for immune gene variant annotation with AWS integration (EC2, S3, DynamoDB, DVC)

python bash aws bioinformatics ec2 genomics dynamodb s3 healthcare cloud-computing bash-script reproducibility data-pipeline annotation-tool dvc streamlit

Updated Jun 13, 2025
Shell

mubtasimfuad / mysql-cdc-to-s3

Real-time MySQL Change Data Capture (CDC) to AWS S3 using Debezium + Kafka + Docker

mysql docker aws real-time s3 cdc data-pipeline debezium

Updated Jul 8, 2025
Shell

samnjenga / real-time-mysql-to-postgresql-cdc-pipeline

Near real-time data replication pipeline from MySQL to PostgreSQL for analytics, reporting, and downstream systems. Supports inserts, updates, deletes, and initial snapshot replication.

mysql streaming real-time kafka etl postgresql data-engineering event-driven data-integration low-latency kafka-connect elt cdc data-pipeline change-data-capture debezium

Updated Apr 21, 2026
Shell

aandriano931 / self-hosted-n8n-scraper-stack

Secure, self-hosted ETL automation pipeline. Built with n8n, PostgreSQL, and Docker Compose under Zero-Trust principles.

automation etl docker-compose postgresql secops iac infrastructure-as-code data-pipeline zero-trust n8n

Updated Feb 22, 2026
Shell

aaronginder / gdp-growth-project

To use dbt as an orchestration tool to process a static file and join two data sources together. This repository can be used as a template example of creating a dbt pipeline with testing. See the two simple sets below to using the dbt pipeline to generate tables in BigQuery (GCP).

testing bigquery automation gcp data-engineering dbt data-pipeline

Updated Mar 12, 2026
Shell

JoshPola96 / heart-attack-data-pipeline

End-to-end data engineering pipeline for analyzing heart attack prediction in Indonesia. Automates data ingestion, transformation, and visualization using GCP (Terraform, BigQuery, Cloud Storage), Apache Airflow, dbt, and Python scripts. Provides actionable insights via Power BI dashboards.

python airflow terraform gcp data-engineering healthcare dbt powerbi data-pipeline

Updated Apr 20, 2025
Shell

tier1-swiss-bank-regulatory-reporting-lakehouse-gcp

sahilgundu / tier1-swiss-bank-regulatory-reporting-lakehouse-gcp

GCP-based Regulatory Reporting Lakehouse — Tier-1 Swiss Bank (Simulated Case Study):- Documentation-only repo illustrating a cloud-native data lakehouse architecture for regulatory reporting on Google Cloud Platform (GCS + BigQuery + Dataflow + Composer). Includes ADRs, runbooks, and compliance data contracts.

bigquery composer gcp pubsub data-engineering dataflow adr runbook data-pipeline lakehouse regulatory-reporting bfsi

Updated Nov 19, 2025
Shell

romanbaptista / sra-convert

A modular, contract-driven HPC pipeline for converting SRA archives to FASTQ using a pinned SRA Toolkit, featuring strict preflight validation, an explicit SLURM execution ABI, parallel array processing, and restart-safe per-accession isolation.

bash workflow bioinformatics genomics hpc reproducible-research convert ngs slurm fastq data-pipeline sra sra-toolkit sequencing-data fasterq-dump

Updated Jun 2, 2026
Shell

kchemorion / nifi

Apache NiFi configuration and workflow templates

etl apache-nifi data-pipeline

Updated Jun 26, 2024
Shell

SeanPCompton / clinical-cohort-analysis-sql

Clinical cohort construction and metric derivation using DuckDB SQL. Explores healthcare encounter data, profiles keys and dates, identifies duplication patterns, and prepares a reproducible pipeline for building an analysis-ready cohort dataset.

csv sql deduplication data-pipeline normalization duckdb

Updated Feb 3, 2026
Shell

ali-ezz / enterprise-airline-attendance-pipeline

Enterprise airline attendance big data pipeline with Hadoop MapReduce, Docker, synthetic dataset generation, and validation.

python java docker big-data hadoop analytics mapreduce data-pipeline

Updated May 6, 2026
Shell

romanbaptista / fastq-align

A modular, contract-driven HPC pipeline for deterministic alignment of paired-end FASTQ data using BWA and samtools, producing sorted and indexed BAM files with strict SLURM execution, explicit ABI contracts, and comprehensive preflight validation.

bash bioinformatics genomics hpc reproducible-research ngs sequencing slurm alignment bam wgs fastq samtools bwa data-pipeline

Updated May 22, 2026
Shell

romanbaptista / fastq-trim

A contract‑driven HPC pipeline for FASTQ trimming using BBDUK or Trimmomatic, featuring strict preflight validation, explicit SLURM ABI propagation, deterministic tool installation, and restart‑safe per-sample execution.

bash workflow bioinformatics genomics hpc reproducible-research sequencing slurm trimming fastq data-pipeline adapter-trimming trimmomatic hpc-workflow bbduk

Updated Jun 2, 2026
Shell

Improve this page

Add a description, image, and links to the data-pipeline topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-pipeline topic, visit your repo's landing page and select "manage topics."