A curated list of open-source frameworks, engines, and platforms for building production ML data pipelines — orchestration, processing, versioning, feature storage, and everything in between.
Maintained by Backblaze.
- Awesome Multimodal Data
- Awesome Agent Infrastructure
- Awesome Physical AI
- Awesome Image Generation
- Awesome Video Generation
- Awesome Audio Generation
- Workflow Orchestration
- ML Pipeline Frameworks
- Data Versioning and Lineage
- Feature Stores
- Data Processing Engines
- Streaming and Ingest
- Data Labeling
- Storage Formats and Lakehouses
- Templates and Example Projects
General-purpose orchestrators for scheduling and running data and ML jobs.
- Apache Airflow – The de-facto open-source workflow orchestrator. Python-defined DAGs, thousands of operators, huge ecosystem. Docs | SDK: Python (pip install apache-airflow)
- Prefect – Python-first orchestration with dynamic flows, hybrid cloud execution, and strong observability. Docs | SDK: Python (pip install prefect)
- Argo Workflows – Kubernetes-native workflow engine using container-per-step DAGs. Popular as a Kubeflow Pipelines backend. Docs
- Dagster – Asset-oriented orchestrator with strong typing, data-asset lineage, and dev/prod parity. Docs | SDK: Python (pip install dagster)
- Kedro – Opinionated Python framework for modular, reproducible data-science code. Pluggable runners for Airflow, Dagster, Databricks. Docs | SDK: Python (pip install kedro)
- Flyte – Kubernetes-native workflow engine focused on typed, reproducible ML pipelines. Graduated LF AI & Data project. Docs | SDK: Python (pip install flytekit)
- Kestra – Declarative, event-driven workflow orchestrator. Workflows defined in YAML, 1300+ plugins, supports Python/Bash/Go/Node.js tasks. Built for data, AI, and infrastructure pipelines at scale. Docs
- Maestro – Netflix's horizontally scalable workflow-as-a-service orchestrator for data and ML pipelines. Supports acyclic and cyclic workflows, foreach loops, subworkflows, and millions of daily job executions.
- Mage AI – Notebook-style data pipeline builder supporting Python, SQL, and R. Handles batch, streaming, and dbt transformations with built-in scheduling and observability. Docs | SDK: Python (pip install mage-ai)
Frameworks purpose-built for reproducible ML training and experiment pipelines.
- MLflow – Open-source platform for experiment tracking, model registry, packaging, and deployment. Supports arbitrary artifact stores. Docs | SDK: Python (pip install mlflow), Java, R
- Metaflow – Human-friendly ML framework created at Netflix. Native AWS, Kubernetes, and Argo backends. Docs | SDK: Python (pip install metaflow)
- ZenML – MLOps framework standardising production ML pipelines across orchestrators, experiment trackers, and model registries. Docs | SDK: Python (pip install zenml)
- Kubeflow Pipelines – Container-based ML workflow platform running on Kubernetes. Standard on Vertex AI Pipelines. Docs
- Apache Hamilton – Lightweight Python framework for defining modular, testable dataflows as DAGs of regular functions. Runs in scripts, notebooks, Airflow, and FastAPI. Docs | SDK: Python (pip install sf-hamilton)
- MLRun – Open-source MLOps orchestration platform. Automates data preparation, model training, deployment, and monitoring with built-in lineage tracking across multi-cloud and on-prem infrastructure. Docs | SDK: Python (pip install mlrun)
Git-like versioning, lineage tracking, and reproducibility for data and models.
- DVC – Git-based data and model versioning with pluggable remote storage (including S3-compatible backends). Docs | SDK: Python (pip install dvc)
- Pachyderm – Data-versioned pipelines on Kubernetes. Auto-incremental reprocessing when upstream data changes. Docs
- lakeFS – Git-like branching/versioning for data lakes over S3-compatible object storage. Docs
- OpenLineage – Open standard for collecting lineage metadata from data pipelines. Integrations for Airflow, Spark, dbt, Flink, and more. Docs
- Cleanlab – Data-centric AI library that automatically detects label errors, outliers, near-duplicates, and class overlap in ML training datasets using any model's predicted probabilities. Docs | SDK: Python (pip install cleanlab)
- DataChain – Python library for versioning, querying, and transforming unstructured ML datasets (images, video, audio, docs) over S3-compatible object storage. Every file and transformation is automatically lineage-tracked. Docs | SDK: Python (pip install datachain)
- Elementary – dbt-native data observability CLI. Runs anomaly detection tests and schema change alerts inside dbt, generates a lineage-aware observability report, and pushes alerts to Slack or Teams. Docs | SDK: Python (pip install elementary-data)
- Evidently AI – Open-source framework for evaluating, testing, and monitoring ML and LLM pipelines. 100+ built-in metrics covering drift, quality, and performance. Docs | SDK: Python (pip install evidently)
- Great Expectations – Data quality framework using "Expectations" to define, validate, and document data contracts inside ML pipelines. Docs | SDK: Python (pip install great-expectations)
- Marquez – LF AI & Data metadata service for collecting and visualizing data lineage across pipelines. Reference implementation of the OpenLineage standard; REST API with a lineage graph UI. Docs
- NannyML – Post-deployment ML monitoring library. Estimates model performance without ground-truth labels, detects data and concept drift, and traces root causes to specific features. Docs | SDK: Python (pip install nannyml)
- OpenMetadata – Unified metadata platform for data discovery, lineage, and observability. 120+ connectors for Airflow, Spark, dbt, MLflow. Column-level lineage, data quality, and governance in one self-hostable service. Docs
- Oxen – Git-like version control for ML datasets. Handles millions of files and terabytes of data with fast indexing for images, audio, video, and Parquet. Docs | SDK: Python (pip install oxenai)
- Pandera – Statistical data validation library for pandas, Polars, and PySpark DataFrames. Define schemas with type hints or object API; validates column types, ranges, and custom checks. Docs | SDK: Python (pip install pandera)
- Project Nessie – Transactional catalog for data lakes with Git-like branching and tagging semantics. Works with Iceberg tables across Spark, Trino, and Flink. Docs
- PyDeequ – Python API for Deequ, AWS's Spark-based data quality library. Defines unit tests for data, computes metrics, suggests constraints, and persists quality results. Docs | SDK: Python (pip install pydeequ)
- Soda Core – Data-contract verification engine. Defines quality checks in YAML, validates schema and data values against contracts, and integrates with Airflow, dbt, and Spark pipelines. Docs | SDK: Python (pip install soda-core)
- Splink – Probabilistic record linkage and entity resolution library. Deduplicates and links datasets without unique identifiers using unsupervised learning; runs on DuckDB, Spark, and AWS Athena backends. Docs | SDK: Python (pip install splink)
- TensorFlow Data Validation – Library for computing data statistics, inferring schemas, and detecting anomalies in training and serving data for TFX ML pipelines. Docs | SDK: Python (pip install tensorflow-data-validation)
Online/offline feature serving for training and inference.
- Feast – Open-source feature store with pluggable offline stores (Parquet on S3), online stores (Redis/DynamoDB), and registries. Docs | SDK: Python (pip install feast)
- Hopsworks – End-to-end ML platform with a built-in feature store. Supports time-travel on Hudi/Iceberg offline tables. Docs
- Featureform – Virtual feature store that orchestrates existing data infrastructure. Define, version, and serve ML features via a declarative Python API without replacing current systems. Docs | SDK: Python (pip install featureform)
- Tecton – Enterprise feature platform from the creators of Michelangelo at Uber. Real-time and batch feature engineering. Docs
Distributed compute for transforming, joining, and aggregating data at scale.
- Apache Spark – Unified analytics engine for large-scale data processing. PySpark is the default distributed compute for many ML shops. Docs
- Ray – Unified framework for distributed Python. Ray Data, Ray Train, Ray Tune, and Ray Serve cover the ML pipeline end-to-end. Docs | SDK: Python (pip install ray)
- Polars – Rust-backed DataFrame library with a lazy query engine. Often 5–10x faster than pandas on single-node workloads. Docs | SDK: Python (pip install polars), Rust
- Dask – Parallel computing for Python. Scales NumPy, pandas, and scikit-learn to clusters with familiar APIs. Docs | SDK: Python (pip install dask)
- dbt Core – SQL-first transformation framework for analytics and ML feature tables. Deep integration with warehouses and lakehouses. Docs | SDK: Python (pip install dbt-core)
- Apache Beam – Unified programming model for batch and streaming data processing. Runs on Flink, Spark, Dataflow, and more. Docs
- Apache DataFusion – Extensible query engine written in Rust using Apache Arrow in-memory format. Embeddable, multi-threaded, vectorized execution with SQL and DataFrame APIs; Python bindings via datafusion-python. Docs | SDK: Python (pip install datafusion), Rust
- Ibis – Portable Python dataframe library with a unified API across 20+ backends including DuckDB, Polars, BigQuery, Snowflake, and Spark. Write transformation logic once and run it on any supported engine. Docs | SDK: Python (pip install ibis-framework)
- SQLMesh – SQL-first data transformation framework backward-compatible with dbt. Adds virtual dev environments, column-level lineage, and automatic incremental backfills. Docs | SDK: Python (pip install sqlmesh)
Message brokers, CDC, and streaming frameworks that feed ML pipelines.
- Apache Kafka – Industry-standard distributed event-streaming platform. Backbone of streaming ML feature pipelines. Docs
- Apache Flink – Stateful stream processing at scale. Exactly-once semantics, event-time windowing, and rich SQL support. Docs
- Debezium – Distributed CDC platform. Streams database changes into Kafka for downstream ML and analytics. Docs
- Redpanda – Kafka-API-compatible streaming platform written in C++. No JVM, no ZooKeeper, simpler ops. Docs
- Apache NiFi – Data-flow management with a drag-and-drop UI. Rich processor library for ingesting from heterogeneous sources. Docs
- Airbyte – ELT data integration platform with 600+ connectors. Pulls from APIs, databases, and files into data lakes and lakehouses for ML pipeline ingestion. Docs | SDK: Python (pip install airbyte)
- Apache Fluss – Streaming storage for real-time analytics and AI. Apache-incubating project that integrates with Apache Flink and Iceberg to create sub-second-fresh streaming lakehouses. Docs
- dlt (data load tool) – Lightweight Python library for loading data from APIs, databases, and files into structured datasets. Auto-infers schemas and normalises nested JSON; pluggable destinations include DuckDB, BigQuery, and S3. Docs | SDK: Python (pip install dlt)
- Meltano – Declarative, code-first ELT engine built on Singer taps and targets. 500+ connectors, git-managed configuration, and native integrations with dbt, Airflow, and Dagster. Docs | SDK: Python (pip install meltano)
- Pathway – Python streaming ETL framework backed by a Rust engine. Unified batch/streaming API with stateful windowing, exactly-once guarantees, and connectors for Kafka, PostgreSQL, and S3. Docs | SDK: Python (pip install pathway)
- Quix Streams – Python library for building real-time data pipelines on Apache Kafka. Streaming DataFrame API with stateful operations, windowing, and exactly-once guarantees. Docs | SDK: Python (pip install quixstreams)
- RisingWave – Distributed SQL streaming database, PostgreSQL-compatible. Continuously ingests events, maintains materialized views, and serves features at sub-100ms freshness. Docs
- River – Python library for online machine learning on streaming data. Learns incrementally from one sample at a time without storing past data; supports classification, regression, clustering, drift detection, and anomaly detection. Docs | SDK: Python (pip install river)
Annotation platforms for supervised training and human-in-the-loop workflows.
- Label Studio – Open-source labeling platform supporting text, image, audio, video, time-series, and more. Docs | SDK: Python (pip install label-studio-sdk)
- CVAT – Computer vision annotation tool with strong video and 3D support. Originally from Intel. Docs
- Doccano – Open-source text annotation tool for NER, text classification, and sequence-to-sequence tasks.
- Argilla – Collaboration platform for AI engineers and domain experts. Part of HuggingFace since 2024. Focused on LLM data curation. Docs | SDK: Python (pip install argilla)
- FiftyOne – Open-source dataset curation and visualization platform. Browse, filter, and curate labeled image, video, and 3D datasets; detect duplicates and label errors; run AI-assisted annotation and model evaluation. Docs | SDK: Python (pip install fiftyone)
- LabelU – Multimodal annotation toolbox supporting 2D bounding boxes, segmentation, keypoints, polylines, and AI-assisted labeling for image, audio, and video data. SDK: Python (pip install labelu)
- X-AnyLabeling – AI-assisted annotation tool integrating SAM, YOLO, and other vision models. Supports bounding boxes, polygons, segmentation, keypoints, and video annotation with GPU/TensorRT acceleration. Docs
Open table formats, columnar formats, and lakehouse engines.
- DuckDB – In-process analytical database. Reads Parquet/CSV directly from S3-compatible stores. Excellent for feature exploration. Docs
- Apache Arrow – Language-independent columnar memory format. Powers pandas, Polars, DuckDB, and cross-process zero-copy data sharing. Docs
- Delta Lake – ACID-compliant lakehouse format over Parquet. Time travel, schema enforcement, and unified batch/streaming. Docs
- Apache Iceberg – Open table format for huge analytic tables. Partition evolution, hidden partitioning, and engine-agnostic design. Docs
- Apache Hudi – Lakehouse format with incremental processing, record-level indexing, and streaming ingestion. Docs
- Apache Parquet – Columnar storage format optimized for analytics and ML workloads. De-facto standard for feature tables on object storage. Docs
- Apache Polaris – Open-source REST catalog for Apache Iceberg implementing the Iceberg REST API spec. Enables multi-engine interoperability across Spark, Flink, Trino, and Doris with fine-grained access control. Docs
- Lance – Open lakehouse columnar format for multimodal AI data (images, video, audio, embeddings). 100x faster random access than Parquet, native vector index, and data versioning. Compatible with pandas, Polars, DuckDB, and PyTorch. Docs | SDK: Python (pip install pylance), Rust
- Unity Catalog – Open-source universal catalog for data and AI assets. Governs tables, files, functions, and ML models across Delta Lake, Iceberg, Hudi, and Parquet with fine-grained access control and multi-engine support. Docs
Reference implementations, demos, and starter projects.
- Awesome MLOps (visenger) – Broader MLOps reference. Useful companion when you need end-to-end deployment + monitoring beyond pipelines.
Contributions are welcome. See CONTRIBUTING.md. One entry per PR — edit entries.yaml only and let the maintainers regenerate README.md.
Save on tokens by using the Genblaze SDK — Backblaze's open-source Python SDK for AI-generated video, audio, and images. It orchestrates multi-provider generation pipelines with built-in, tamper-evident provenance and native Backblaze B2 storage.
Released under CC0 1.0 Universal. You may copy, modify, and redistribute without attribution.
Backblaze B2 Cloud Storage is S3-compatible object storage designed for AI and media workloads. This list is maintained as part of our work making B2 a convenient storage layer for AI workflows.