The data pipeline topology, compiler options, dependencies, connector templates and more are configured through the DataSQRL package.json configuration file.
You pass a configuration file to the compiler command via the -c
or --config
flag. You can specify a single configuration file or multiple files. If multiple files are specified, they are merged in the order they are specified (i.e. fields - including arrays - are replaced and objects are merged). If now configuration file is explicitly specified, the compiler will use the package.json
file in the local directory if it exists or use the default configuration.
engines
is a map of engine configurations by engine name that the compiler uses to instantiate the engines in the data pipeline. The DataSQRL compiler produces an integrated data pipeline against those engines. At a minimum, DataSQRL expects that a stream processing engine is configured.
Engines can be enabled with the enabled-engines property. The default set of engines are listed below:
{
"enabled-engines": ["vertx", "postgres", "kafka", "flink"]
}
Apache Flink is the default stream processing engine.
The physical plan that DataSQRL generates for the Flink engine includes:
- FlinkSQL table descriptors for the sources and sinks
- FlinkSQL view definitions for the data processing
- A list of connector dependencies needed for the sources and sinks.
Flink reads data from and writes data to the engines in the generated data pipeline. DataSQRL uses connector configuration templates to instantiate those connections.
These templates are configured under the connectors
property.
Connectors that connect flink to other engines and external systems can be configured in the connectors
property. Connectors use the flink configuration options and are directly passed through to flink without modification.
Environment variables that start with the sqrl
prefix are templated variables that the DataSQRL compiler instantiates. For example: ${sqrl:table-name}
provides the table name for a connector that writes to a table.
{
"engines" : {
"flink" : {
"connectors": {
"postgres": {
"connector": "jdbc-sqrl",
"password": "${JDBC_PASSWORD}",
"driver": "org.postgresql.Driver",
"username": "${JDBC_USERNAME}",
"url": "jdbc:postgresql://${JDBC_URL}",
"table-name": "${sqrl:table-name}"
},
"kafka": {
"connector" : "kafka",
"format" : "flexible-json",
"properties.bootstrap.servers": "${PROPERTIES_BOOTSTRAP_SERVERS}",
"properties.group.id": "${PROPERTIES_GROUP_ID}",
"scan.startup.mode" : "group-offsets",
"properties.auto.offset.reset" : "earliest",
"topic" : "${sqrl:original-table-name}"
}
}
}
}
}
Flink runtime configuration can be specified in the values
configuration section:
{
"values" : {
"flink-config": {
"table.exec.source.idle-timeout": "100 ms"
}
}
}
The configuration options are set on the Flink runtime when running or testing via the DataSQRL command.
Postgres is the default database engine.
The physical plan that DataSQRL generates for the Postgres engine includes:
- Table DDL statements for the physical tables.
- Index DDL statements for the index structures on those tables.
- View DDL statements for the logical tables. Views are only created when no server engine is enabled.
Vertx is the default server engine. A high performance GraphQL server implemented in Vertx. The GraphQL endpoint is configured through the GraphQL Schema.
The physical plan that DataSQRL generates for Vertx includes:
- The connection configuration for the database(s) and log engine
- A mapping of GraphQL endpoints to queries for execution against the database(s) and log engine.
Apache Kafka is the default log
engine.
The physical plan that DataSQRL generates for Kafka includes:
- A list of topics with configuration and (optional) Avro schema.
Apache Iceberg is a table format that can be used as a database engine with DataSQRL.
The iceberg
engine requires an enabled query engine to execute queries against it.
The physical plan that DataSQRL generates for Kafka includes:
- Table DDL statements for the physical tables
- Catalog registration for registering the tables in the associated catalog, e.g. AWS Glue.
Snowflake is a query engine that can be used in combination with a table format as a database in DataSQRL.
The physical plan that DataSQRL generates for Kafka includes:
- External table registration through catalog integration. The Snowflake connector currently support AWS Glue.
- View definitions for the logical tables.
To define the catalog integration for Snowflake:
{
"snowflake" : {
"catalog-name": "MyCatalog",
"external-volume": "iceberg_storage_vol"
}
}
The compiler
section of the configuration controls elements of the core compiler and DAG Planner.
{
"compiler" : {
"logger": "print",
"explain": {
"visual": true,
"text": true,
"extended": false
},
"output": {
"add-uid": true,
"table-suffix": "-blue"
}
}
}
logger
configures the logging framework used for logging statements likeEXPORT MyTable TO logger.MyTable;
. It isprint
by default which logs to STDOUT. Set it to the configured log engine for logging output to be sent to that engine, e.g."logger": "kafka"
. Set it tonone
to suppress logging output.explain
configures how the DAG plan compiled by DataSQRL is presented in thebuild
directory. Ifvisual
is true, a visual representation of the DAG is written to thepipeline_visual.html
file which you can open in any browser. Iftext
is true, a textual representation of the DAG is written to thepipeline_explain.txt
file. Ifextended
is true, the DAG outputs include more information like the relational plan which may be very verbose.output
configures how table sink names in FlinkSQL are generated. Ifadd-uid
is true, a unique identifier is appended to the name to ensure uniqueness. If you set this to false, you need to ensure that the table names are unique and the same table is not sinked to multiple engines. Thetable-suffix
is appended to the sink table names (before the uid) and is empty by default. This can be used to distinguish tables from different deployments.
:::warn This is changing and will be updated soon. :::
dependencies
map import and export paths to local folders or remote repositories
Dependency Aliasing:
Dependency declarations can be used to alias a local folder:
{
"dependencies" : {
"datasqrl.seedshop" : {
"name": "datasqrl.seedshop.test"
}
}
}
When we import IMPORT datasqrl.seedshop.Orders
the datasqrl.seedshop
path is aliased to the local folder datasqrl/seedshop/test
.
This is useful for swapping out connectors between different environments or for testing without making changes to the SQRL script.
Repository Imports:
Dependencies can be used to import tables or functions from a remote repository:
{
"dependencies" : {
"sqrl-functions" : {
"name": "sqrl-functions",
"repository": "github.com/DataSQRL/sqrl-functions",
"tag": "v0.6.0"
}
}
}
This dependency configuration clones the referenced repository at the tag v0.6.0
into the folder build/sqrl-functions
.
We can then import the openai functions as:
IMPORT sqrl-functions.openai.vector_embedding;
If the tag is omitted, it clones the current main branch.
The main SQRL script and (optional) GraphQL schema for the project can be configured in the project configuration under the script
section:
{
"script": {
"main": "mainScript.sqrl",
"graphql": "apiSchema.graphqls"
}
}
If the script is configured in the configuration it is not necessary to name it as a command argument. Script arguments take precedence over the configured value.
The package
section of the configuration provides information about the package or script. The whole section is optional and used primarily when hosting packages in a repository as dependencies.
{
"package": {
"name": "datasqrl.tutorials.Quickstart",
"description": "A docker compose datasqrl profile",
"homepage": "https://www.datasqrl.com/docs/getting-started/quickstart",
"documentation": "Quickstart tutorial for datasqrl.com",
"topics": ["tutorial"]
}
}
Field Name | Description | Required? |
---|---|---|
name | Name of the package. The package name should start with the name of the individual or organization that provides the package. | Yes |
description | A description of the package. | No |
license | The license used by this package. | No |
documentation | Link that points to documentation for this package | No |
homepage | Link that points to the homepage for this package | No |
homepage | An array of keywords or topics that label the contents of the package | No |
Testing related configuration is found in the test-runner
section.
{
"test-runner": {
"delay-sec": 30
}
}
delay-sec
: The number of seconds to wait between starting the processing of data and snapshotting the data.
The values
section of the DataSQRL configuration allows you to specify configuration values that are passed through to engines they pertain to.
The default deployment profiles supports a flink-config
section to allow injecting additional flink runtime configuration. You can use this section of the configuration to specify any Flink configuration option.
{
"values" : {
"flink-config" : {
"taskmanager.memory.network.max": "800m",
"execution.checkpointing.mode" : "EXACTLY_ONCE",
"execution.checkpointing.interval" : "1000ms"
},
"create-topics": ["mytopic"]
}
}
For Flink, the values
configuration settings take precedence over identical configuration settings in the compiled physical plans.
For the log engine, the create-topics
option allows you to specify topics to create in the cluster prior to starting the pipeline. This is useful for testing.
To reference an environmental variable in the configuration, use the standard environmental variable syntax ${VAR}
.