Skip to content

Documentation for 0.6 release #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/architectures/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Data Architectures
115 changes: 43 additions & 72 deletions docs/getting-started/concepts/datasqrl.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,91 +4,62 @@ title: "What is DataSQRL?"

# What is DataSQRL?


DataSQRL is an open-source compiler and build tool for implementing data products as data pipelines. A [data product](/docs/reference/concepts/data-product) processes, transforms, or analyzes data from one or multiple sources (user input, databases, data streams, API calls, file storage, etc.) and exposes the result as raw data, in a database, or through an API. <br />
DataSQRL eliminates most of the laborious code of implementing and stitching together multiple technologies into data pipelines.

Building a data product with DataSQRL takes 3 steps:

1. **Implement SQL script:** You combine, transform, and analyze the input data using SQL.
2. **Expose Data (optional):** You define how to expose the transformed data in the API or database.
3. **Compile Data Pipeline:** DataSQRL compiles the SQL script and output specification into a fully integrated data pipeline. The compiled data pipeline ingests raw data, processes it according to the transformations and analyses defined in the SQL script, and serves the resulting data through the specified API or database.
DataSQRL is a flexible data development framework for building various types of streaming data architectures, like data pipelines, event-driven microservices, and Kappa. It provides the basic structure, common patterns, and a set of tools for streamlining the development process of [data products](/docs/reference/concepts/data-product).

DataSQRL integrates any combination of the following technologies:
* **Apache Flink:** a distributed and stateful stream processing engine.
* **Apache Kafka:** a distributed streaming platform.
* **PostgreSQL:** a reliable open-source relational database system.
* **Apache Iceberg:** an open table format for large analytic datasets.
* **Snowflake:** a scalable cloud data warehousing platform.
* **RedPanda:** a Kafka-compatible streaming data platform.
* **Yugabyte:** a distributed open-source relational database.
* **Vert.x:** a reactive server framework for building data APIs.

You define the data processing in SQL (with support for custom functions in Java, Scala and soon Python) and DataSQRL generates the glue code, schemas, and mappings to automatically connect and configure these components into a coherent data architecture. DataSQRL also generates Docker Compose templates for local execution or deployment to Kubernetes or cloud-managed services.

<img src="docs/img/datasqrl_architectures.jpg" alt="The architectures that DataSQRL supports " width="100%"/>

DataSQRL supports multiple types of data architectures as shown above. Learn more about the [10 types of data architectures](../when-datasqrl) you can build with DataSQRL.

## DataSQRL Features

* 🔗 **System Integration:** Combine various data technologies into streamlined data architectures.
* ☯️ **Declarative + Imperative:** Define the data flow in SQL and specific data transformations in Java, Scala, or soon Python.
* 🧪 **Testing Framework:** Automated snapshot testing.
* 🔄 **Data Flow Optimization:** Optimize data flow between systems through data mapping, partitioning, and indexing for scalability and performance.
* ✔️ **Consistent:** Ensure at-least or exactly-once data processing for consistent results across the entire system.
* 📦 **Dependency management:** Manage data sources and sinks with versioning and repository.
* 📊 **GraphQL Schema Generator:** Expose processed data through a GraphQL API with subscription support for headless data services. (REST coming soon)
* 🤖 **Integrated AI:** Support for vector data type, vector embeddings, LLM invocation, and ML model inference.
* { } **JSON Support:** Native JSON data type and JSON schema discovery.
* 🔍 **Visualization Tools:** Inspect and debug data architectures visually.
* 🪵 **Logging framework:** for observability and debugging.
* 🚀 **Deployment Profiles:** Automate the deployment of data architectures through configuration.

In a nutshell, DataSQRL is an abstraction layer that takes care of the nitty-gritties of building efficient data pipelines and gives developers an easy-to-use tool to build data products.

Follow the [quickstart tutorial](../../quickstart) to build a data product in a few minutes and see how DataSQRL works in practice.

## How DataSQRL Works

<img src="/img/reference/compiledPipeline.svg" alt="Compiled DataSQRL data pipeline >" width="60%"/>

DataSQRL compiles the SQL script and output specification into a data pipeline that uses data technologies like [Apache Kafka](https://kafka.apache.org/), [Apache Flink](https://flink.apache.org/), or [Postgres](https://postgresql.org/).

DataSQRL has a pluggable engine architecture which allows it to support various stream processors, databases, data warehouses, data streams, and API servers. Feel free to contribute your favorite data technology as a DataSQRL engine to the open-source, wink wink.

DataSQRL can generate data pipelines with multiple topologies. Take a look at the [types of data products](/docs/reference/concepts/data-product#types) that DataSQRL can build. You can further customize those pipeline topologies in the DataSQRL [package configuration](/docs/reference/sqrl/datasqrl-spec/) which defines the data technologies at each stage of the resulting data pipeline.

DataSQRL compiles executables for each engine in the pipeline which can be deployed on the data technologies and cloud services you already use.
In addition, DataSQRL provides development tooling that makes it easy to run and test data pipelines locally to speed up the development cycle.

## What DataSQRL Does

Okay, you get the idea of a compiler that produces integrated data pipelines. But what exactly does DataSQRL do for you? Glad you asked.

<img src="/img/index/howDataSQRLWorksPipeline.svg" alt="DataSQRL Compilation >" width="50%"/>

To produce fully integrated data pipelines, the DataSQLR compiler:
* resolves data imports to data source connectors and generates input schemas for the stream ingestion,
* synchronizes data schemas and data management across all engines in the data pipeline,
* aligns timestamps and watermarks across the engines,
* orchestrates optimal data flow between engines,
* translates the SQL script to the respective engine for execution,
* and generates an API server that implements the given API specification.

To produce high-performance data pipelines that respond to new input data in realtime and provide low latency, high throughput APIs to many concurrent users, DataSQRL optimizes the compiled data pipeline by:
* partitioning the data flow and co-locating data where possible.
* pruning the execution graph and consolidating repetitive computations.
* determining when to pre-compute data transformations in the streaming engine to reduce response latencies versus computing result sets at request time in the database or server to avoid data staleness and combinatorial explosion in pre-computed results.
* determining the optimal set of index structures to install in the database.

In other words, DataSQRL can save you a lot of time and allows you to focus on what matters: implementing the logic and API of your data product.

## Learn More

- Read the [quickstart tutorial](../../quickstart) to get a feel for DataSQRL while building an entire data product in 10 minutes.
- Find out [Why DataSQRL Exists](../why-datasqrl) and what benefits it provides.
- [Compare DataSQRL](../../concepts/when-datasqrl) to other data technologies and see when to use it.
- Learn more about the [DataSQRL Optimizer](/docs/reference/sqrl/learn/#datasqrl-optimizer) and how the DataSQRL compiler generates efficient data pipelines.

<!--
### More

<img src="/img/getting-started/tutorial/nutshop.jpg" alt="Nut Shop Tutorial >|" width="40%"/>

**STEP 1:** Read the [Quickstart](../quickstart) to build a metrics monitoring data product in 10 minutes.

**STEP 2:** Follow one or more of the [DataSQRL tutorials](../tutorials/overview) to learn how to implement various use cases and how to apply the features DataSQRL provides.

**STEP 3:** Build your own data product with DataSQRL. Take a problem from work or grab some data you've been interested in and give it a go.
DataSQRL extends ANSI SQL with additional features designed for data development:

Need more information? Take a look at the [reference documentation](/docs/reference/introduction) for everything you'd ever wanted to know about DataSQRL and then some. <br />
Got stuck? No worries, the [DataSQRL community](/community) is here to help. Seriously, reach out - we don't bite!
* **IMPORT/EXPORT statements**: Integrate data sources and export data to sinks.
* **Assignment Operator (:=)**: Define incremental table structures.
* **Stream Processing SQL**: Enhanced SQL statements for stream processing.
* **Nested Structures**: Natively support nested data structures like JSON.

## Understanding the Big Picture
<img src="docs/img/dag_example.png" alt="Example DataSQRL DAG >" width="50%" />

There are a million technologies out there so why should you spend your time on DataSQRL? If you want to understand how DataSQRL fits into the bigger picture and whether it's worth your time, here are some resources to get you started.
DataSQRL translates these SQL scripts into a data processing DAG (Directed Acyclic Graph) as visualized above, linking source and sink definitions. The cost-based optimizer cuts the DAG into segments executed by different engines (e.g. Flink, Kafka, Postgres, Vert.x), generating the necessary physical plans, schemas, and connectors for a fully integrated and streamlined data architecture. This "plan" can be instantiated by deployment profiles, such as Docker Compose templates for local execution.

<img src="/img/index/undraw_questions_sqrl.svg" alt="DataSQRL allows you to build with data >" width="40%"/>
Check out the [reference documentation](/docs/reference/sqrl/sqrl-spec) for a deep-dive on all things DataSQRL.

DataSQRL is a compiler, optimizer, and build tool for data pipelines and event-driven microservices. To implement a data product in DataSQRL, you implement the data processing in SQL and (optionally) define the API of your data product in GraphQL schema. DataSQRL compiles those two artifacts into an optimized data pipeline that ingests, processes, stores, queries, and serves data through a responsive API in realtime.
## Why DataSQRL?

DataSQRL solves the [data plumbing](../concepts/why-datasqrl#dataplumbing) issue that plagues most data product implementations. It eliminates integration code, schema mappings, physical data modeling, data flow orchestration, and other low-level implementation details that take a lot of time and effort. DataSQRL enables you to implement the entire data pipeline in one piece of code and compiles all the executables you need to deploy the pipeline. In other words, DataSQRL saves you a ton of time, money, and headache.
Data engineers spend considerable time integrating various tools and technologies, ensuring performance, scalability, robustness, and observability. DataSQRL automates these tasks, making it easier to implement, test, debug, observe, deploy, and maintain data products. Like a web development framework, but for data.

DataSQRL supports various pipeline topologies and has a pluggable engine architecture that allows DataSQRL to compile to proven technologies like Apache Kafka, Apache Flink, and Postgres. That means you are not relying on DataSQRL in production but can use the technologies and cloud services you already trust. DataSQRL compiles data pipelines that are resilient, fast, and scalable by using an optimizer that determines the most efficient data pipeline for a configured architecture.
Our goal is to eliminate the data engineering busywork, so you can focus on building and iterating on data products. Learn more about [Why DataSQRL Exists](../why-datasqrl) and what benefits it provides.

* [**What is DataSQRL?**](../concepts/datasqrl): DataSQRL compiles optimized data pipelines. [Learn more](../concepts/datasqrl) about DataSQRL and how it works.
* [**Why Use DataSQRL?**](../concepts/why-datasqrl): DataSQRL eliminates data plumbing enabling you to ship data products quickly with less effort. [Learn more](../concepts/why-datasqrl) about the benefits of DataSQRL.
* [**When Should I Use DataSQRL?**](../concepts/when-datasqrl): DataSQRL empowers your team to build efficient data products successfully. [Find out](../concepts/when-datasqrl) when and when not to use DataSQRL.

What to know more? Start with the [reference documentation](/docs/reference/introduction) to learn everything there is to know about DataSQRL. <br />
-->
2 changes: 2 additions & 0 deletions docs/getting-started/concepts/when-datasqrl.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ title: "When to use DataSQRL"

# When Should I Use DataSQRL?

replace with 10 architectures

DataSQRL is an intelligent compiler for data pipelines that eliminates data plumbing so you
can build efficient data products faster, cheaper, and better.

Expand Down
Loading
Loading