Skip to content

Commit 279c698

Browse files
Quickstart updates and changes to overall help flow. (#222)
* Quickstart updates and changes to overall help flow. * Additions to guides, more TOC/flow updates, some editing. * Updated per PR feedback.
1 parent 889fcc7 commit 279c698

20 files changed

Lines changed: 238 additions & 177 deletions

README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,33 @@
1-
# Overview
1+
## What is SQLMesh?
22

33
SQLMesh is a next-generation SQL transformation platform. It provides you with powerful automation for versioning, backfilling, deployment, and testing — allowing you to focus on simply writing SQL.
44

55
SQLMesh is able to achieve all of this with minimal setup; there are no additional services or dependencies required to get started using SQLMesh other than a connection to your existing data warehouse or engine.
66

77
## Why SQLMesh?
88

9-
One of the main advantages over other transformation frameworks is that SQLMesh does not categorize incrementality as an "advanced" use case that should be avoided unless absolutely necessary. While other frameworks default to full refresh compute, the default for SQLMesh is to optimize for incremental compute, i.e. computing one day or hour at a time. This allows SQLMesh to be faster and more scalable than other frameworks, allowing you to take advantage of the cost and time savings of incrementality.
9+
One of the main advantages over other transformation frameworks is that SQLMesh does not categorize incrementality as an "advanced" use case that should be avoided unless absolutely necessary. While other frameworks default to full refresh compute, the default for SQLMesh is to optimize for incremental compute, i.e. computing one day or hour at a time. This allows SQLMesh to be faster and more scalable than other frameworks, allowing you to take advantage of the cost and time savings of incrementality.
1010

11-
SQLMesh also automates away complexity, so configuring models is no longer tricky due to complex macros that require understanding of the context for execution. Writing your data pipelines incrementally with SQLMesh not only saves you money and time, but keeps your systems maintainable, reliable, and accessible to all of your data practicioners.
11+
SQLMesh also automates away complexity, so configuring models is no longer tricky due to complex macros that require understanding of the context for execution. Writing your data pipelines incrementally with SQLMesh not only saves you money and time, but keeps your systems maintainable, reliable, and accessible to all of your data practictioners.
1212

1313
### Reduced cost
14-
As discussed above, incremental compute is significantly cheaper than full refresh compute.
14+
Incremental compute is significantly cheaper than full refresh compute.
1515

16-
For example, if you have one year of history but only receive new data on a daily basis, only processing that new data is ~365x cheaper than reprocessing one year each day. As your data grows, it's possible that refreshing your tables may take longer than a day, which means you would never be able to catch up!
16+
For example, if you have one year of history but only receive new data on a daily basis, just processing that new data is ~365x cheaper than reprocessing one year each day. As your data grows, it's possible that refreshing your tables may take longer than a day, which means you would never be able to catch up!
1717

1818
In addition, you may not be able to refresh particular tables all at once; they may need to be batched into smaller intervals. The cost of your data pipelines compound as more dependent pipelines are created. Therefore, writing your data pipelines incrementally as much as possible can result in exponential savings.
1919

2020
### Increased efficiency
2121
SQLMesh safely reuses physical tables across isolated environments. Some databases, such as Snowflake, have [zero-copy cloning](https://docs.snowflake.com/en/user-guide/tables-storage-considerations.html#label-cloning-tables) — but this is a manual process, and not widely supported.
2222

23-
SQLMesh is able to automatically reuse tables regardless of which data warehouse or engine you're using. This is achieved by storing fingerprints of your models and by employing [views](https://en.wikipedia.org/wiki/View_(SQL)) like pointers to physical locations. Therefore, spinning up a new development environment is fast and cheap; only models with incompatible changes need to be materialized, once again saving time and money.
23+
SQLMesh is able to automatically reuse tables regardless of which data warehouse or engine you're using. This is achieved by storing fingerprints of your models and by employing [views](https://en.wikipedia.org/wiki/View_(SQL)) like pointers to physical locations. Therefore, spinning up a new development environment is fast and cheap; only models with incompatible changes need to be materialized, saving time and money.
2424

2525
### Automation for everyone
26-
Creating maintainable and scalable data pipelines is extremely difficult, and is a task usually reserved for data engineers. As your data grows, the need for incremental compute becomes mandatory due to the cost and time constaints.
26+
Creating maintainable and scalable data pipelines is extremely difficult, and a task usually reserved for data engineers. As your data grows, the need for incremental compute becomes mandatory due to the cost and time constaints.
2727

28-
Incremental models have inherent state of which partitions have been computed. This makes managing the consistency and accuracy challenging (leaving no data leakages or gaps). Although a seasoned engineer may have the expertise or tooling to operate one of these tables, an analyst would not. In these organizations, analysts would either need to file a ticket and wait on data engineering resources, or bypass core data models by running their own custom jobs, which inevitably leads to an ungoverned data mess. SQLMesh democratizes the ability to write safe and scalable data pipelines to all data practitioners, regardless of technical ability.
28+
Incremental models have inherent state of which partitions have been computed. This makes managing the consistency and accuracy challenging (leaving no data leakages or gaps).
29+
30+
Although a seasoned engineer may have the expertise or tooling to operate one of these tables, an analyst would not. In these organizations, analysts would either need to file a ticket and wait on data engineering resources, or bypass core data models by running their own custom jobs, which inevitably leads to an ungoverned data mess. SQLMesh democratizes the ability to write safe and scalable data pipelines to all data practitioners, regardless of technical ability.
2931

3032
### Complexity made simple
3133
As more and more models and users depend on core tables, the complexity of making changes increases. You must ensure that all downstream data consumers are compatible and updated with any new changes.
@@ -35,7 +37,9 @@ Propagating a change throughout a complex graph of dependencies is difficult to
3537
### Collaboration and integration
3638
SQLMesh allows for data pipelines to be a collaborative experience. It both empowers less technical data users to contribute and enables them to collaborate with others who may be more familiar with data engineering. Development can be done in a fully isolated environment that can be accessed and validated by others.
3739

38-
SQLMesh provides information about changes and how they may affect your downstream consumers. This transparency, along with the ability to categorize changes, makes it more feasible for a less technically savvy user to make updates to core data pipelines. By integrating with our Continuous Integration/Continuous Delivery (CI/CD) flows, you can require approval for any changes before going to production, ensuring that the relevant data owners or experts can review and validate the changes.
40+
SQLMesh provides information about changes and how they may affect your downstream consumers. This transparency, along with the ability to categorize changes, makes it more feasible for a less technically savvy user to make updates to core data pipelines.
41+
42+
By integrating with our Continuous Integration/Continuous Delivery (CI/CD) flows, you can require approval for any changes before going to production, ensuring that the relevant data owners or experts can review and validate the changes.
3943

4044
### Testing and reliability
4145
SQLMesh supports both audits and tests. Although unit tests has been commonplace in the world of software engineering, they are relatively unknown in the data world. SQLMesh's data unit tests allow for stability and reliability, as data pipeline owners can ensure that changes to models don't change underlying logic. These tests can run quickly in CI, or locally without having to create full scale tables.

docs/api/overview.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
# Overview
1+
# API
22

3-
SQLMesh can be used with a [cli](cli.md), [notebook](notebook.md), or directly through [Python](python.md). Each interface aims to have parity in both functionality and arguments.
3+
SQLMesh can be used with a [CLI](cli.md), [notebook](notebook.md), or directly through [Python](python.md). Each interface aims to have parity in both functionality and arguments. The following is a list of available commands.
44

55
## plan
66
Plan is the main command of SQLMesh. It allows you to interactively create a migration plan, understand the downstream impact, and apply it. All changes to models and environments are materialized through plan.
77

8-
Read more about [plan](/concepts/plans).
8+
Read more about [plans](/concepts/plans).
99

1010
## evaluate
1111
Evaluate a model or snapshot (running its query against a DB/Engine). This method is used to test or iterate on models without side effects.
@@ -19,14 +19,18 @@ Given a SQL query, fetches a pandas dataframe.
1919
## test
2020
Runs all tests.
2121

22+
Read more about [testing](/guides/tests).
23+
2224
## audit
2325
Runs all audits.
2426

27+
Read more about [auditing](/guides/audits).
28+
2529
## format
2630
Formats all SQL model files in place.
2731

2832
## diff
29-
Shows the diff between the local model and a model in an evironment.
33+
Shows the diff between the local model and a model in an environment.
3034

3135
## dag
3236
Shows the [DAG](../glossary.md).

docs/api/python.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
11
# Python
2+
3+
## TODO

docs/concepts/audits.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1-
# Audits
2-
Audits are one of the tools SQLMesh provides to validate your data. Along with tests, audits are a great way to ensure the quality of your data and to build trust in your data across your organization. A comprehensive suite of audits can identify data issues upstream, whether they are from your vendors or other teams. Audits also empower your data engineers and analysts to work with confidence by catching problems early as they work on new features or make updates to your models.
1+
# Auditing
2+
Audits are one of the tools SQLMesh provides to validate your models. Along with [tests](/concepts/tests), they are a great way to ensure the quality of your data and to build trust in it across your organization.
33

4-
## What exactly are audits?
5-
Audits are SQL queries that should not return any rows. In other words, they query for bad data, so returned rows would indicate something is wrong. In its simplest form, an audit is defined with the custom AUDIT expression along with a query as in the following example:
4+
A comprehensive suite of audits can identify data issues upstream, whether they are from your vendors or other teams. Audits also empower your data engineers and analysts to work with confidence by catching problems early as they work on new features or make updates to your models.
5+
6+
## Example audit
7+
In SQLMesh, audits are defined in `.sql` files in an `audit` directory in your SQLMesh project. Multiple audits can be defined in a single file, so you can organize them to your liking. Audits are SQL queries that should not return any rows; in other words, they query for bad data, so returned rows indicates that something is wrong. In its simplest form, an audit is defined with the custom AUDIT expression along with a query, as in the following example:
68

79
```sql
810
AUDIT (
@@ -15,15 +17,16 @@ WHERE ds BETWEEN @start_ds AND @end_ds AND
1517
price IS NULL
1618
```
1719

18-
In the example above, we defined an audit named `assert_item_price_is_not_null` on the model `sushi.items` to ensure that every sushi item has a price. If the query is in a different dialect than the rest of your project, you can specify it here as we did in the example, and SQLGlot will automatically understand how to execute the query. While the query can technically be on any model or even multiple models, the model specified in the audit definition tells SQLMesh when to run the audit during your pipeline's execution. If the query returns any records, it means there may be an issue that requires your attention.
20+
In the example, we defined an audit named `assert_item_price_is_not_null` on the model `sushi.items`, ensuring that every sushi item has a price.
1921

20-
Audits are defined in `.sql` files in an `audit` directory in your SQLMesh project. Multiple audits can be defined in a single file, so you can organize them to your liking.
22+
**Note:** If the query is in a different dialect than the rest of your project, you can specify it here as we did in the example, and SQLGlot will automatically understand how to execute the query.
2123

22-
## Running audits
24+
While the query can technically be on any model or even multiple models, the model specified in the audit definition tells SQLMesh when to run the audit during your pipeline's execution. If the query returns any records, it means there is a potential issue requiring your attention.
2325

26+
## Run an audit
2427
### The CLI audit command
2528

26-
You can execute audits with the `sqlmesh audit` command, as in the following example:
29+
You can execute audits with the `sqlmesh audit` command as follows:
2730
```
2831
% sqlmesh --path project audit -start 2022-01-01 -end 2022-01-02
2932
Found 1 audit(s).
@@ -41,10 +44,8 @@ Done.
4144
When you apply a plan, SQLMesh will automatically run each model's audits. By default, SQLMesh will halt the pipeline when an audit fails in order to prevent potentially invalid data from propagating further downstream. This behvavior can be changed for individual audits. Refer to [Non-blocking audits](#non-blocking-audits).
4245

4346
## Advanced usage
44-
4547
### Skipping audits
46-
47-
Audits can be skipped by setting the skip argument to true as in the following example:
48+
Audits can be skipped by setting the `skip` argument to `true` as in the following example:
4849

4950
```sql
5051
AUDIT (
@@ -58,8 +59,7 @@ WHERE ds BETWEEN @start_ds AND @end_ds AND
5859
```
5960

6061
### Non-blocking audits
61-
62-
By default, audits that fail will stop the execution of the pipeline in order to prevent bad data from propagating further downstream. An audit can be configured to notify you when it fails without blocking the execution of the pipeline, as in the following example:
62+
By default, audits that fail will stop the execution of the pipeline in order to prevent bad data from propagating further. An audit can be configured to notify you when it fails without blocking the execution of the pipeline, as in the following example:
6363

6464
```sql
6565
AUDIT (

docs/concepts/configs.md

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Configs
2-
Configs define settings for things like engines (eg. Snowflake or Spark), schedulers (eg. Airflow or Dagster), and the SQL dialect. The config file is defined in config.py in the root directory of your SQLMesh project.
2+
Configs define settings for things like engines (such as Snowflake or Spark), schedulers (such as Airflow or Dagster), and the SQL dialect. The config file is defined in config.py in the root directory of your SQLMesh project.
33

44
## Settings
55
### connections
6-
A dictionary of supported connection and their configurations. The key represents a unique connection name. If there is only one connection, its configuration can be provided directly omitting the dictionary.
6+
A dictionary of supported connection and their configurations. The key represents a unique connection name. If there is only one connection, its configuration can be provided directly, omitting the dictionary.
77

88
```python
99
import duckdb
@@ -17,8 +17,7 @@ Config(
1717
```
1818

1919
### scheduler
20-
Identifies which scheduler backend to use. The scheduler backend is used for both storing metadata and executing [plans](/concepts/plans). By default, the `BuiltinSchedulerBackend` is used which uses the existing SQL engine to store metadata and has a simple scheduler. The `AirflowSchedulerBackend` should be used if you want to integrate with Airflow.
21-
20+
Identifies which scheduler backend to use. The scheduler backend is used both for storing metadata and executing [plans](/concepts/plans). By default, the `BuiltinSchedulerBackend` is used, which uses the existing SQL engine to store metadata and has a simple scheduler. The `AirflowSchedulerBackend` should be used if you want to integrate with Airflow.
2221

2322
```python
2423
from sqlmesh.core.config import AirflowSchedulerConfig, Config
@@ -27,16 +26,16 @@ Config(scheduler=AirflowSchedulerConfig())
2726
```
2827

2928
### notification_targets
30-
Notification targets are used to receive logging or updates as SQLMesh processes things. Notification targets can be used to implement things like integration with Github or Slack.
29+
Used to receive logging or updates as SQLMesh processes things. Notification targets can be used to implement things such as integration with Github or Slack.
3130

3231
### dialect
3332
The default sql dialect of model queries. Default: same as engine dialect. The dialect is used if a [model](/concepts/models) does not define a dialect. Note that this dialect only specifies what the model is written as. At runtime, model queries will be transpiled to the correct engine dialect.
3433

3534
### physical_schema
36-
The default schema used to store materialized tables. By default this will store all physical tables managed by SQLMesh in the `sqlmesh` schema/db in your warehouse.
35+
The default schema used to store materialized tables. By default, this will store all physical tables managed by SQLMesh in the `sqlmesh` schema/db in your warehouse.
3736

3837
### snapshot_ttl
39-
Duration before unpromoted snapshots are removed. This is defined as a string with the default be `in 1 week`. Other [relative strings](https://dateparser.readthedocs.io/en/latest/) can be used liked `in 30 days`.
38+
Duration before unpromoted snapshots are removed. This is defined as a string with the default `in 1 week`. Other [relative strings](https://dateparser.readthedocs.io/en/latest/) can be used, such as `in 30 days`.
4039

4140
### time_column_format
4241
The default format to use for all model time columns. Defaults to %Y-%m-%d.
@@ -47,9 +46,8 @@ This time format uses python format codes. https://docs.python.org/3/library/dat
4746
A list of users that can be used for approvals/notifications.
4847

4948
## Precedence
50-
5149
You can configure your project in multiple places, and SQLMesh will prioritize configurations according to
52-
the following order. From least to greatest precedence:
50+
the following order, from least to greatest precedence:
5351

5452
- A Config object defined in a config.py file at the root of your project:
5553

@@ -90,11 +88,10 @@ local_config = Config(
9088
... )
9189
```
9290

93-
## Using Config
94-
95-
The most common way to configure your SQLMesh project is with a `config.py` module at the root of the
91+
## Using Config objects
92+
The most common way to configure your SQLMesh project is with a `config.py` module at the root of your
9693
project. A SQLMesh Context will automatically look for Config objects there. You can have multiple
97-
Config objects defined, and then tell Context which one to use. For example, you can have different
94+
Config objects defined and tell Context which one to use. For example, you can have different
9895
Configs for local and production environments, Airflow, and Model tests.
9996

10097
Example config.py:
@@ -127,4 +124,3 @@ To use a Config, pass in its variable name to Context.
127124

128125
```
129126

130-
For more information about the Config class and its parameters, see `sqlmesh.core.config.Config`.

0 commit comments

Comments
 (0)