Skip to content

DataSQRL/sqrl

Repository files navigation

DataSQRL

CircleCI Docs

License Docker Image Version Maven Central

DataSQRL is a data automation framework for building reliable data pipelines, data APIs (REST, MCP, GraphQL), and data products in SQL using open-source technologies.

DataSQRL provides three key elements for AI-assisted data platform automation:

  1. World Model: DataSQRL builds a source-to-sink computational graph of the data processing including schemas, connectors, and mappings, which provides a comprehensive world model to ground generative AI.
  2. Simulation: DataSQRL includes a runtime and testing framework to ensure data integrity and act as a simulator in iterative refinement loops with real-world feedback.
  3. Verification: Since the entire data pipeline is defined in SQL, it is easy to understand and verify. DataSQRL produces detailed execution plans and lineage graphs to assist automated and manual analysis.

DataSQRL generates the deployment artifacts to execute the entire pipeline on open-source technologies like PostgreSQL, Apache Kafka, Apache Flink, and Apache Iceberg on your existing infrastructure with Docker, Kubernetes, or cloud-managed services.

DataSQRL Pipeline Architecture

DataSQRL models data pipelines with the following requirements:

  • 🛡️ Data Consistency Guarantees: Exactly-once processing, data consistency across all outputs, schema alignment, and data lineage tracking.
  • 🔒 Production-grade Reliability: Robust, highly available, scalable, secure, access-controlled, and observable data pipelines.
  • 🚀 Developer Workflow Integration: Local development, quick iteration with feedback, CI/CD support, and comprehensive testing framework.

To learn more about DataSQRL, check out the documentation.

Getting Started

To create a new data project with DataSQRL, use the init command in an empty folder.

 docker run --rm -v $PWD:/build datasqrl/cmd init api messenger

(Use ${PWD} in Powershell on Windows).

This creates a new data API project called messenger with some sample data sources and a simple data processing script called messenger.sqrl.

Run the project with

docker run -it --rm -p 8888:8888 -p 8081:8081 -v $PWD:/build datasqrl/cmd run messenger-prod-package.json

This launches the entire data pipeline for ingesting, processing, storing, and serving messages. You can access the API in your browser http://localhost:8888/v1/graphiql/ and add messages with the following mutation:

mutation {
    Messages(event: {message: "Hello World"}) {
    message_time
  }
}

Query messages with:

{
    Messages {
    message
    message_time
  }
}

Alternatively, you can query messages through REST or MCP. Once you are done, terminate the pipeline with CTRL-C.

For additional data processing, edit the messenger.sqrl script - for example to aggregate messages:

TotalMessages := SELECT COUNT(*) as num_messages, MAX(message_time) as latest_timestamp
                 FROM Messages LIMIT 1;

To run the test case, execute:

docker run -it --rm -v $PWD:/build datasqrl/cmd test messenger-test-package.json

To build the deployment assets for the data pipeline, execute

docker run --rm -v $PWD:/build datasqrl/cmd compile messenger-prod-package.json

The build/deploy directory contains the Flink compiled plan, Kafka topic definitions, PostgreSQL schema and view definitions, server queries, MCP tool definitions, and GraphQL data model. Those assets can be deployed in containerized environments (e.g. via Kubernetes) or cloud-managed services.

Read the full Getting Started tutorial or check out the DataSQRL Examples repository for more examples creating MCP servers, data APIs, Iceberg views and more.

Why DataSQRL?

AI-driven data platform automation is within reach. However, trust-worthy automation requires more than generative AI. It requires a world model that understands your data landscape, enforces constraints, and provides the grounding and feedback loops needed for safe, reliable automation.

DataSQRL is an open-source world model for data platform automation. As a modular framework, it provides the building blocks to build a customized world model for your organization to give AI a set of guardrails that ensure generated solutions are safe, reliable, and perform well in production.

How DataSQRL Works

Example Data Processing DAG

DataSQRL is a modular compiler framework for data pipelines that (deterministically) automates a lot of data plumbing code in data pipelines. This significantly reduces the complexity of AI-assisted (i.e. probabilistic) automation and provides feedback through deep introspection of the pipeline code.

This allows you to generate data processing logic in SQL using any AI coding tools or agents. DataSQRL compiles the SQL into a data processing DAG (Directed Acyclic Graph) according to the provided configuration. The analyzer traverses the DAG to detect potential data inconsistencies, performance, or scalability issues. The cost-based optimizer cuts the DAG into segments executed by different engines (e.g. Flink, Kafka, Postgres, Vert.x), generating the necessary physical plans, schemas, and connectors for a fully integrated, reliable, and consistent data pipeline. The compiled artifacts are fed back to the AI for iterative refinement to improve the solution incrementally.

In addition, the compiled deployment assets can be executed locally in Docker, Kubernetes, or by a managed cloud service. DataSQRL comes with a testing framework for simulation of the data pipeline. This provides real-world feedback on the results and operational characteristics that are included in the iterative refinement feedback loop.

DataSQRL gives you full visibility and control over the generated data pipeline. Since the entire pipeline is implemented in SQL it is easy to understand and verify manually.

DataSQRL uses proven open-source technologies to execute the generated deployment assets. You can use your existing infrastructure or cloud services for runtime, DataSQRL is only used at compile time.

DataSQRL has a rich function library and provides connectors for many popular data systems (Kafka, Iceberg, Postgres, and many more). In addition, DataSQRL is an extensible framework, and you can add custom functions, source/sink connectors, and entire execution engines.

Read an in-depth explanation of DataSQRL or view the full documentation to learn more.

Contributing

Contribute to DataSQRL

Our goal is to automate data platforms by building a world model that provides the necessary guardrails and feedback. We believe anyone who can read SQL should be empowered to build complex data systems that are robust and reliable. Your feedback is invaluable in achieving this goal. Let us know what works and what doesn't by filing GitHub issues or starting discussions.

We welcome code contributions. For more details, check out CONTRIBUTING.md.

About

Data Pipeline Automation Framework to build MCP servers, data APIs, and data lakes with SQL.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors 14

Languages