Awesome ML Data Pipelines

A curated list of open-source frameworks, engines, and platforms for building production ML data pipelines — orchestration, processing, versioning, feature storage, and everything in between.

Maintained by Backblaze.

Related Lists

Workflow Orchestration

General-purpose orchestrators for scheduling and running data and ML jobs.

Apache Airflow – The de-facto open-source workflow orchestrator. Python-defined DAGs, thousands of operators, huge ecosystem. Docs | SDK: Python (pip install apache-airflow)
Prefect – Python-first orchestration with dynamic flows, hybrid cloud execution, and strong observability. Docs | SDK: Python (pip install prefect)
Argo Workflows – Kubernetes-native workflow engine using container-per-step DAGs. Popular as a Kubeflow Pipelines backend. Docs
Dagster – Asset-oriented orchestrator with strong typing, data-asset lineage, and dev/prod parity. Docs | SDK: Python (pip install dagster)
Kedro – Opinionated Python framework for modular, reproducible data-science code. Pluggable runners for Airflow, Dagster, Databricks. Docs | SDK: Python (pip install kedro)
Flyte – Kubernetes-native workflow engine focused on typed, reproducible ML pipelines. Graduated LF AI & Data project. Docs | SDK: Python (pip install flytekit)
Kestra – Declarative, event-driven workflow orchestrator. Workflows defined in YAML, 1300+ plugins, supports Python/Bash/Go/Node.js tasks. Built for data, AI, and infrastructure pipelines at scale. Docs
Maestro – Netflix's horizontally scalable workflow-as-a-service orchestrator for data and ML pipelines. Supports acyclic and cyclic workflows, foreach loops, subworkflows, and millions of daily job executions.
Mage AI – Notebook-style data pipeline builder supporting Python, SQL, and R. Handles batch, streaming, and dbt transformations with built-in scheduling and observability. Docs | SDK: Python (pip install mage-ai)

ML Pipeline Frameworks

Frameworks purpose-built for reproducible ML training and experiment pipelines.

MLflow – Open-source platform for experiment tracking, model registry, packaging, and deployment. Supports arbitrary artifact stores. Docs | SDK: Python (pip install mlflow), Java, R
Metaflow – Human-friendly ML framework created at Netflix. Native AWS, Kubernetes, and Argo backends. Docs | SDK: Python (pip install metaflow)
ZenML – MLOps framework standardising production ML pipelines across orchestrators, experiment trackers, and model registries. Docs | SDK: Python (pip install zenml)
Kubeflow Pipelines – Container-based ML workflow platform running on Kubernetes. Standard on Vertex AI Pipelines. Docs
Apache Hamilton – Lightweight Python framework for defining modular, testable dataflows as DAGs of regular functions. Runs in scripts, notebooks, Airflow, and FastAPI. Docs | SDK: Python (pip install sf-hamilton)
MLRun – Open-source MLOps orchestration platform. Automates data preparation, model training, deployment, and monitoring with built-in lineage tracking across multi-cloud and on-prem infrastructure. Docs | SDK: Python (pip install mlrun)

Data Versioning and Lineage

Git-like versioning, lineage tracking, and reproducibility for data and models.

DVC – Git-based data and model versioning with pluggable remote storage (including S3-compatible backends). Docs | SDK: Python (pip install dvc)
Pachyderm – Data-versioned pipelines on Kubernetes. Auto-incremental reprocessing when upstream data changes. Docs
lakeFS – Git-like branching/versioning for data lakes over S3-compatible object storage. Docs
OpenLineage – Open standard for collecting lineage metadata from data pipelines. Integrations for Airflow, Spark, dbt, Flink, and more. Docs
Cleanlab – Data-centric AI library that automatically detects label errors, outliers, near-duplicates, and class overlap in ML training datasets using any model's predicted probabilities. Docs | SDK: Python (pip install cleanlab)
DataChain – Python library for versioning, querying, and transforming unstructured ML datasets (images, video, audio, docs) over S3-compatible object storage. Every file and transformation is automatically lineage-tracked. Docs | SDK: Python (pip install datachain)
Elementary – dbt-native data observability CLI. Runs anomaly detection tests and schema change alerts inside dbt, generates a lineage-aware observability report, and pushes alerts to Slack or Teams. Docs | SDK: Python (pip install elementary-data)
Evidently AI – Open-source framework for evaluating, testing, and monitoring ML and LLM pipelines. 100+ built-in metrics covering drift, quality, and performance. Docs | SDK: Python (pip install evidently)
Great Expectations – Data quality framework using "Expectations" to define, validate, and document data contracts inside ML pipelines. Docs | SDK: Python (pip install great-expectations)
Marquez – LF AI & Data metadata service for collecting and visualizing data lineage across pipelines. Reference implementation of the OpenLineage standard; REST API with a lineage graph UI. Docs
NannyML – Post-deployment ML monitoring library. Estimates model performance without ground-truth labels, detects data and concept drift, and traces root causes to specific features. Docs | SDK: Python (pip install nannyml)
OpenMetadata – Unified metadata platform for data discovery, lineage, and observability. 120+ connectors for Airflow, Spark, dbt, MLflow. Column-level lineage, data quality, and governance in one self-hostable service. Docs
Oxen – Git-like version control for ML datasets. Handles millions of files and terabytes of data with fast indexing for images, audio, video, and Parquet. Docs | SDK: Python (pip install oxenai)
Pandera – Statistical data validation library for pandas, Polars, and PySpark DataFrames. Define schemas with type hints or object API; validates column types, ranges, and custom checks. Docs | SDK: Python (pip install pandera)
Project Nessie – Transactional catalog for data lakes with Git-like branching and tagging semantics. Works with Iceberg tables across Spark, Trino, and Flink. Docs
PyDeequ – Python API for Deequ, AWS's Spark-based data quality library. Defines unit tests for data, computes metrics, suggests constraints, and persists quality results. Docs | SDK: Python (pip install pydeequ)
Soda Core – Data-contract verification engine. Defines quality checks in YAML, validates schema and data values against contracts, and integrates with Airflow, dbt, and Spark pipelines. Docs | SDK: Python (pip install soda-core)
Splink – Probabilistic record linkage and entity resolution library. Deduplicates and links datasets without unique identifiers using unsupervised learning; runs on DuckDB, Spark, and AWS Athena backends. Docs | SDK: Python (pip install splink)
TensorFlow Data Validation – Library for computing data statistics, inferring schemas, and detecting anomalies in training and serving data for TFX ML pipelines. Docs | SDK: Python (pip install tensorflow-data-validation)

Feature Stores

Online/offline feature serving for training and inference.

Feast – Open-source feature store with pluggable offline stores (Parquet on S3), online stores (Redis/DynamoDB), and registries. Docs | SDK: Python (pip install feast)
Hopsworks – End-to-end ML platform with a built-in feature store. Supports time-travel on Hudi/Iceberg offline tables. Docs
Featureform – Virtual feature store that orchestrates existing data infrastructure. Define, version, and serve ML features via a declarative Python API without replacing current systems. Docs | SDK: Python (pip install featureform)
Tecton – Enterprise feature platform from the creators of Michelangelo at Uber. Real-time and batch feature engineering. Docs

Data Processing Engines

Distributed compute for transforming, joining, and aggregating data at scale.

Apache Spark – Unified analytics engine for large-scale data processing. PySpark is the default distributed compute for many ML shops. Docs
Ray – Unified framework for distributed Python. Ray Data, Ray Train, Ray Tune, and Ray Serve cover the ML pipeline end-to-end. Docs | SDK: Python (pip install ray)
Polars – Rust-backed DataFrame library with a lazy query engine. Often 5–10x faster than pandas on single-node workloads. Docs | SDK: Python (pip install polars), Rust
Dask – Parallel computing for Python. Scales NumPy, pandas, and scikit-learn to clusters with familiar APIs. Docs | SDK: Python (pip install dask)
dbt Core – SQL-first transformation framework for analytics and ML feature tables. Deep integration with warehouses and lakehouses. Docs | SDK: Python (pip install dbt-core)
Apache Beam – Unified programming model for batch and streaming data processing. Runs on Flink, Spark, Dataflow, and more. Docs
Apache DataFusion – Extensible query engine written in Rust using Apache Arrow in-memory format. Embeddable, multi-threaded, vectorized execution with SQL and DataFrame APIs; Python bindings via datafusion-python. Docs | SDK: Python (pip install datafusion), Rust
Ibis – Portable Python dataframe library with a unified API across 20+ backends including DuckDB, Polars, BigQuery, Snowflake, and Spark. Write transformation logic once and run it on any supported engine. Docs | SDK: Python (pip install ibis-framework)
SQLMesh – SQL-first data transformation framework backward-compatible with dbt. Adds virtual dev environments, column-level lineage, and automatic incremental backfills. Docs | SDK: Python (pip install sqlmesh)

Streaming and Ingest

Message brokers, CDC, and streaming frameworks that feed ML pipelines.

Apache Kafka – Industry-standard distributed event-streaming platform. Backbone of streaming ML feature pipelines. Docs
Apache Flink – Stateful stream processing at scale. Exactly-once semantics, event-time windowing, and rich SQL support. Docs
Debezium – Distributed CDC platform. Streams database changes into Kafka for downstream ML and analytics. Docs
Redpanda – Kafka-API-compatible streaming platform written in C++. No JVM, no ZooKeeper, simpler ops. Docs
Apache NiFi – Data-flow management with a drag-and-drop UI. Rich processor library for ingesting from heterogeneous sources. Docs
Airbyte – ELT data integration platform with 600+ connectors. Pulls from APIs, databases, and files into data lakes and lakehouses for ML pipeline ingestion. Docs | SDK: Python (pip install airbyte)
Apache Fluss – Streaming storage for real-time analytics and AI. Apache-incubating project that integrates with Apache Flink and Iceberg to create sub-second-fresh streaming lakehouses. Docs
dlt (data load tool) – Lightweight Python library for loading data from APIs, databases, and files into structured datasets. Auto-infers schemas and normalises nested JSON; pluggable destinations include DuckDB, BigQuery, and S3. Docs | SDK: Python (pip install dlt)
Meltano – Declarative, code-first ELT engine built on Singer taps and targets. 500+ connectors, git-managed configuration, and native integrations with dbt, Airflow, and Dagster. Docs | SDK: Python (pip install meltano)
Pathway – Python streaming ETL framework backed by a Rust engine. Unified batch/streaming API with stateful windowing, exactly-once guarantees, and connectors for Kafka, PostgreSQL, and S3. Docs | SDK: Python (pip install pathway)
Quix Streams – Python library for building real-time data pipelines on Apache Kafka. Streaming DataFrame API with stateful operations, windowing, and exactly-once guarantees. Docs | SDK: Python (pip install quixstreams)
RisingWave – Distributed SQL streaming database, PostgreSQL-compatible. Continuously ingests events, maintains materialized views, and serves features at sub-100ms freshness. Docs
River – Python library for online machine learning on streaming data. Learns incrementally from one sample at a time without storing past data; supports classification, regression, clustering, drift detection, and anomaly detection. Docs | SDK: Python (pip install river)

Data Labeling

Annotation platforms for supervised training and human-in-the-loop workflows.

Label Studio – Open-source labeling platform supporting text, image, audio, video, time-series, and more. Docs | SDK: Python (pip install label-studio-sdk)
CVAT – Computer vision annotation tool with strong video and 3D support. Originally from Intel. Docs
Doccano – Open-source text annotation tool for NER, text classification, and sequence-to-sequence tasks.
Argilla – Collaboration platform for AI engineers and domain experts. Part of HuggingFace since 2024. Focused on LLM data curation. Docs | SDK: Python (pip install argilla)
FiftyOne – Open-source dataset curation and visualization platform. Browse, filter, and curate labeled image, video, and 3D datasets; detect duplicates and label errors; run AI-assisted annotation and model evaluation. Docs | SDK: Python (pip install fiftyone)
LabelU – Multimodal annotation toolbox supporting 2D bounding boxes, segmentation, keypoints, polylines, and AI-assisted labeling for image, audio, and video data. SDK: Python (pip install labelu)
X-AnyLabeling – AI-assisted annotation tool integrating SAM, YOLO, and other vision models. Supports bounding boxes, polygons, segmentation, keypoints, and video annotation with GPU/TensorRT acceleration. Docs

Storage Formats and Lakehouses

Open table formats, columnar formats, and lakehouse engines.

DuckDB – In-process analytical database. Reads Parquet/CSV directly from S3-compatible stores. Excellent for feature exploration. Docs
Apache Arrow – Language-independent columnar memory format. Powers pandas, Polars, DuckDB, and cross-process zero-copy data sharing. Docs
Delta Lake – ACID-compliant lakehouse format over Parquet. Time travel, schema enforcement, and unified batch/streaming. Docs
Apache Iceberg – Open table format for huge analytic tables. Partition evolution, hidden partitioning, and engine-agnostic design. Docs
Apache Hudi – Lakehouse format with incremental processing, record-level indexing, and streaming ingestion. Docs
Apache Parquet – Columnar storage format optimized for analytics and ML workloads. De-facto standard for feature tables on object storage. Docs
Apache Polaris – Open-source REST catalog for Apache Iceberg implementing the Iceberg REST API spec. Enables multi-engine interoperability across Spark, Flink, Trino, and Doris with fine-grained access control. Docs
Lance – Open lakehouse columnar format for multimodal AI data (images, video, audio, embeddings). 100x faster random access than Parquet, native vector index, and data versioning. Compatible with pandas, Polars, DuckDB, and PyTorch. Docs | SDK: Python (pip install pylance), Rust
Unity Catalog – Open-source universal catalog for data and AI assets. Governs tables, files, functions, and ML models across Delta Lake, Iceberg, Hudi, and Parquet with fine-grained access control and multi-engine support. Docs

Templates and Example Projects

Reference implementations, demos, and starter projects.

Awesome MLOps (visenger) – Broader MLOps reference. Useful companion when you need end-to-end deployment + monitoring beyond pipelines.

Contributing

Contributions are welcome. See CONTRIBUTING.md. One entry per PR — edit entries.yaml only and let the maintainers regenerate README.md.

Start building with Genblaze

Save on tokens by using the Genblaze SDK — Backblaze's open-source Python SDK for AI-generated video, audio, and images. It orchestrates multi-provider generation pipelines with built-in, tamper-evident provenance and native Backblaze B2 storage.

License

Released under CC0 1.0 Universal. You may copy, modify, and redistribute without attribution.

About Backblaze B2

Backblaze B2 Cloud Storage is S3-compatible object storage designed for AI and media workloads. This list is maintained as part of our work making B2 a convenient storage layer for AI workflows.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
categories.yaml		categories.yaml
entries.yaml		entries.yaml
footer.md		footer.md
header.md		header.md
llms.txt		llms.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome ML Data Pipelines

Related Lists

Contents

Workflow Orchestration

ML Pipeline Frameworks

Data Versioning and Lineage

Feature Stores

Data Processing Engines

Streaming and Ingest

Data Labeling

Storage Formats and Lakehouses

Templates and Example Projects

Contributing

Start building with Genblaze

License

About Backblaze B2

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome ML Data Pipelines

Related Lists

Contents

Workflow Orchestration

ML Pipeline Frameworks

Data Versioning and Lineage

Feature Stores

Data Processing Engines

Streaming and Ingest

Data Labeling

Storage Formats and Lakehouses

Templates and Example Projects

Contributing

Start building with Genblaze

License

About Backblaze B2

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages