Skip to content

Latest commit

 

History

History
289 lines (214 loc) · 11.3 KB

configuration.md

File metadata and controls

289 lines (214 loc) · 11.3 KB

DataSQRL Configuration

The data pipeline topology, compiler options, dependencies, connector templates and more are configured through the DataSQRL package.json configuration file.

You pass a configuration file to the compiler command via the -c or --config flag. You can specify a single configuration file or multiple files. If multiple files are specified, they are merged in the order they are specified (i.e. fields - including arrays - are replaced and objects are merged). If now configuration file is explicitly specified, the compiler will use the package.json file in the local directory if it exists or use the default configuration.

Engines

engines is a map of engine configurations by engine name that the compiler uses to instantiate the engines in the data pipeline. The DataSQRL compiler produces an integrated data pipeline against those engines. At a minimum, DataSQRL expects that a stream processing engine is configured.

Engines can be enabled with the enabled-engines property. The default set of engines are listed below:

{
  "enabled-engines": ["vertx", "postgres", "kafka", "flink"]
}

Flink

Apache Flink is the default stream processing engine.

The physical plan that DataSQRL generates for the Flink engine includes:

  • FlinkSQL table descriptors for the sources and sinks
  • FlinkSQL view definitions for the data processing
  • A list of connector dependencies needed for the sources and sinks.

Flink reads data from and writes data to the engines in the generated data pipeline. DataSQRL uses connector configuration templates to instantiate those connections. These templates are configured under the connectors property.

Connectors that connect flink to other engines and external systems can be configured in the connectors property. Connectors use the flink configuration options and are directly passed through to flink without modification.

Environment variables that start with the sqrl prefix are templated variables that the DataSQRL compiler instantiates. For example: ${sqrl:table-name} provides the table name for a connector that writes to a table.

{
  "engines" : {
    "flink" : {
      "connectors": {
        "postgres": {
          "connector": "jdbc-sqrl",
          "password": "${JDBC_PASSWORD}",
          "driver": "org.postgresql.Driver",
          "username": "${JDBC_USERNAME}",
          "url": "jdbc:postgresql://${JDBC_URL}",
          "table-name": "${sqrl:table-name}"
        },
        "kafka": {
          "connector" : "kafka",
          "format" : "flexible-json",
          "properties.bootstrap.servers": "${PROPERTIES_BOOTSTRAP_SERVERS}",
          "properties.group.id": "${PROPERTIES_GROUP_ID}",
          "scan.startup.mode" : "group-offsets",
          "properties.auto.offset.reset" : "earliest",
          "topic" : "${sqrl:original-table-name}"
        }
      }
    }
  }
}

Flink runtime configuration can be specified in the values configuration section:

{
  "values" : {
    "flink-config": {
      "table.exec.source.idle-timeout": "100 ms"
    }
  }
}

The configuration options are set on the Flink runtime when running or testing via the DataSQRL command.

Postgres

Postgres is the default database engine.

The physical plan that DataSQRL generates for the Postgres engine includes:

  • Table DDL statements for the physical tables.
  • Index DDL statements for the index structures on those tables.
  • View DDL statements for the logical tables. Views are only created when no server engine is enabled.

Vertx

Vertx is the default server engine. A high performance GraphQL server implemented in Vertx. The GraphQL endpoint is configured through the GraphQL Schema.

The physical plan that DataSQRL generates for Vertx includes:

  • The connection configuration for the database(s) and log engine
  • A mapping of GraphQL endpoints to queries for execution against the database(s) and log engine.

Kafka

Apache Kafka is the default log engine.

The physical plan that DataSQRL generates for Kafka includes:

  • A list of topics with configuration and (optional) Avro schema.

Iceberg

Apache Iceberg is a table format that can be used as a database engine with DataSQRL.

The iceberg engine requires an enabled query engine to execute queries against it.

The physical plan that DataSQRL generates for Kafka includes:

  • Table DDL statements for the physical tables
  • Catalog registration for registering the tables in the associated catalog, e.g. AWS Glue.

Snowflake

Snowflake is a query engine that can be used in combination with a table format as a database in DataSQRL.

The physical plan that DataSQRL generates for Kafka includes:

  • External table registration through catalog integration. The Snowflake connector currently support AWS Glue.
  • View definitions for the logical tables.

To define the catalog integration for Snowflake:

{
  "snowflake" : {
    "catalog-name": "MyCatalog",
    "external-volume": "iceberg_storage_vol"
  }
}

Compiler

The compiler section of the configuration controls elements of the core compiler and DAG Planner.

{
  "compiler" : {
    "logger": "print",
    "explain": {
      "visual": true,
      "text": true,
      "extended": false
    },
    "output": {
      "add-uid": true,
      "table-suffix": "-blue"
    }    
  }
}
  • logger configures the logging framework used for logging statements like EXPORT MyTable TO logger.MyTable;. It is print by default which logs to STDOUT. Set it to the configured log engine for logging output to be sent to that engine, e.g. "logger": "kafka". Set it to none to suppress logging output.
  • explain configures how the DAG plan compiled by DataSQRL is presented in the build directory. If visual is true, a visual representation of the DAG is written to the pipeline_visual.html file which you can open in any browser. If text is true, a textual representation of the DAG is written to the pipeline_explain.txt file. If extended is true, the DAG outputs include more information like the relational plan which may be very verbose.
  • output configures how table sink names in FlinkSQL are generated. If add-uid is true, a unique identifier is appended to the name to ensure uniqueness. If you set this to false, you need to ensure that the table names are unique and the same table is not sinked to multiple engines. The table-suffix is appended to the sink table names (before the uid) and is empty by default. This can be used to distinguish tables from different deployments.

Dependencies

:::warn This is changing and will be updated soon. :::

dependencies map import and export paths to local folders or remote repositories

Dependency Aliasing:

Dependency declarations can be used to alias a local folder:

{
  "dependencies" : {
    "datasqrl.seedshop" : {
      "name": "datasqrl.seedshop.test"
    }
  }
}

When we import IMPORT datasqrl.seedshop.Orders the datasqrl.seedshop path is aliased to the local folder datasqrl/seedshop/test.

This is useful for swapping out connectors between different environments or for testing without making changes to the SQRL script.

Repository Imports:

Dependencies can be used to import tables or functions from a remote repository:

{
  "dependencies" : {
    "sqrl-functions" : {
      "name": "sqrl-functions",
      "repository": "github.com/DataSQRL/sqrl-functions",
      "tag": "v0.6.0"
    }
  }
}

This dependency configuration clones the referenced repository at the tag v0.6.0 into the folder build/sqrl-functions. We can then import the openai functions as:

IMPORT sqrl-functions.openai.vector_embedding;

If the tag is omitted, it clones the current main branch.

Script

The main SQRL script and (optional) GraphQL schema for the project can be configured in the project configuration under the script section:

 {
  "script": {
    "main": "mainScript.sqrl",
    "graphql": "apiSchema.graphqls"
  }
}

If the script is configured in the configuration it is not necessary to name it as a command argument. Script arguments take precedence over the configured value.

Package Information

The package section of the configuration provides information about the package or script. The whole section is optional and used primarily when hosting packages in a repository as dependencies.

{
  "package": {
    "name": "datasqrl.tutorials.Quickstart",
    "description": "A docker compose datasqrl profile",
    "homepage": "https://www.datasqrl.com/docs/getting-started/quickstart",
    "documentation": "Quickstart tutorial for datasqrl.com",
    "topics": ["tutorial"]
  }
}
Field Name Description Required?
name Name of the package. The package name should start with the name of the individual or organization that provides the package. Yes
description A description of the package. No
license The license used by this package. No
documentation Link that points to documentation for this package No
homepage Link that points to the homepage for this package No
homepage An array of keywords or topics that label the contents of the package No

Testing

Testing related configuration is found in the test-runner section.

{
  "test-runner": {
    "delay-sec": 30
  }
}
  • delay-sec: The number of seconds to wait between starting the processing of data and snapshotting the data.

Values

The values section of the DataSQRL configuration allows you to specify configuration values that are passed through to engines they pertain to.

The default deployment profiles supports a flink-config section to allow injecting additional flink runtime configuration. You can use this section of the configuration to specify any Flink configuration option.

{
 "values" : {
    "flink-config" : {
      "taskmanager.memory.network.max": "800m",
      "execution.checkpointing.mode" : "EXACTLY_ONCE",
      "execution.checkpointing.interval" : "1000ms"
    },
    "create-topics": ["mytopic"]
   }
}

For Flink, the values configuration settings take precedence over identical configuration settings in the compiled physical plans.

For the log engine, the create-topics option allows you to specify topics to create in the cluster prior to starting the pipeline. This is useful for testing.

Environmental Variables

To reference an environmental variable in the configuration, use the standard environmental variable syntax ${VAR}.