|
| 1 | +# FAQ |
| 2 | + |
| 3 | +## General |
| 4 | + |
| 5 | +???+ question "What is SQLMesh?" |
| 6 | + SQLMesh is an open source data transformation framework that brings the best practices of DevOps to data teams. It enables data engineers, scientists, and analysts to efficiently run and deploy data transformations written in SQL or Python. |
| 7 | + |
| 8 | + It is created and maintained by Tobiko Data, a company founded by data leaders from Airbnb, Apple, and Netflix. |
| 9 | + |
| 10 | + Check out the [quickstart guide](./quick_start.md) to see it in action. |
| 11 | + |
| 12 | +??? question "What is SQLMesh used for?" |
| 13 | + SQLMesh is used to manage and execute data transformations - the process of converting raw data into a form useful for making business decisions. |
| 14 | + |
| 15 | +??? question "What problems does SQLMesh solve?" |
| 16 | + **Problem: organizing, maintaining, and changing data transformation code in SQL or Python** |
| 17 | + |
| 18 | + Solutions: |
| 19 | + |
| 20 | + - Identify dependencies among data transformation models and determine the order in which they should run |
| 21 | + - Run data audits and unit tests to prevent unintended side effects from code changes |
| 22 | + - Implement best practices from the DevOps paradigm, such as development environments and continuous integration/continuous development (CI/CD) |
| 23 | + - Execute transformations written in one SQL dialect on an engine/database that runs a different SQL dialect (SQL transpilation) |
| 24 | + |
| 25 | + <br> |
| 26 | + |
| 27 | + **Problem: understanding a complex set of data transformations** |
| 28 | + |
| 29 | + Solutions: |
| 30 | + |
| 31 | + - Determine and display the flow of data through data transformation models |
| 32 | + - Trace which columns in a table contribute to a column in another table (column-level lineage) |
| 33 | + |
| 34 | + <br> |
| 35 | + |
| 36 | + **Problem: inefficient, unnecessarily expensive data transformations** |
| 37 | + |
| 38 | + Solutions: |
| 39 | + |
| 40 | + - Understand the impacts of a code change on the codebase and underlying data tables *without running the code* |
| 41 | + - Efficiently deploy code changes by only running the transformations impacted by the changes |
| 42 | + - Safely promote transformations executed in a development environment to production so computations aren’t needlessly re-executed |
| 43 | + |
| 44 | + <br> |
| 45 | + |
| 46 | + **Problem: complex business requirements and data transformations** |
| 47 | + |
| 48 | + Solutions: |
| 49 | + |
| 50 | + - Easily and safely implement incremental data loading |
| 51 | + - Perform complex data transformations or operations with Python models (e.g., machine learning models, geocoding) |
| 52 | + |
| 53 | + <br> |
| 54 | + |
| 55 | + ...and more! |
| 56 | + |
| 57 | +??? question "What is semantic understanding of SQL?" |
| 58 | + Semantic understanding is the result of analyzing SQL code to determine what it does at a granular level. SQLMesh uses the free, open-source Python library [SQLGlot](https://github.com/tobymao/sqlglot) to parse the SQL code and build the semantic understanding. |
| 59 | + |
| 60 | + Semantic understanding allows SQLMesh to do things like transpilation (executing one SQL dialect on an engine running another dialect) and protecting incremental loading queries from duplicating data. |
| 61 | + |
| 62 | +## Getting started |
| 63 | + |
| 64 | +??? question "How do I install SQLMesh?" |
| 65 | + SQLMesh is a Python library. After ensuring you have [an appropriate Python runtime](./prerequisites.md), install it [with `pip`](./installation.md). |
| 66 | + |
| 67 | +??? question "How do I use SQLmesh?" |
| 68 | + SQLMesh has three interfaces: [command line](./reference/cli.md), [Jupyter or Databricks notebook](./reference/notebook.md), and graphical user interface. |
| 69 | + |
| 70 | + The [quickstart guide](./quick_start.md) demonstrates an example project in each of the interfaces. |
| 71 | + |
| 72 | +## Databases/Engines |
| 73 | + |
| 74 | +??? question "What databases/engines does SQLMesh work with?" |
| 75 | + SQLMesh works with BigQuery, Databricks, DuckDB, PostgreSQL, GCP PostgreSQL, Redshift, Snowflake, and Spark. See [this page](./integrations/engines.md) for more information. |
| 76 | + |
| 77 | +??? question "When would you use different databases for executing data transformations and storing state information?" |
| 78 | + SQLMesh requires storing information about projects and when their transformations were run. By default, it stores this information in the same database where the models run. |
| 79 | + |
| 80 | + Unlike data transformations, storing state information requires database transactions. Some databases, like BigQuery, aren’t optimized for executing transactions, so storing state information in them can slow down your project. If this occurs, you can store state information in a different database, such as PostgreSQL, that executes transactions more efficiently. |
| 81 | + |
| 82 | +## How is this different from dbt? |
| 83 | + |
| 84 | +??? question "Terminology differences?" |
| 85 | + - dbt “materializations” are analogous to [`model kinds` in SQLMesh](./concepts/models/model_kinds.md) |
| 86 | + - dbt seeds are a [model kind in SQLMesh](./concepts/models/model_kinds.md#seed) |
| 87 | + - dbt’s “tests” are called [`audits` in SQLMesh](./concepts/audits.md) because they are auditing the contents of *data* that already exists. [SQLMesh `tests`](./concepts/tests.md) are equivalent to “unit tests” in software engineering - they evaluate the correctness of *code* based on known inputs and outputs. |
| 88 | + - `dbt build` is analogous to [`sqlmesh run`](./reference/cli.md#run) |
| 89 | + |
| 90 | +??? question "Workflow differences?" |
| 91 | + **dbt workflow** |
| 92 | + |
| 93 | + - Configure your project and set up one database connection target for each environment you will use during development |
| 94 | + - Create, configure, and modify models, seeds, tests, and other project components |
| 95 | + - Execute `dbt build` (or its constituent parts `dbt run`, `dbt seed`, etc.) to evaluate and test the project components |
| 96 | + - Execute `dbt build` (or its constituent parts `dbt run`, `dbt seed`, etc.) on a schedule to ingest and transform new data |
| 97 | + |
| 98 | + **SQLMesh workflow** |
| 99 | + |
| 100 | + - Configure your project and set up a project database (using DuckDB locally or a database connection) |
| 101 | + - Create, configure, and modify models, audits, tests, and other project components |
| 102 | + - Execute `sqlmesh plan [environment name]` to: |
| 103 | + - Generate a summary of the differences between your project files and the environment and whether each change is `breaking`. The `plan` includes a list of the actions needed to implement the changes and automatically runs the project's unit `test`s. |
| 104 | + - Optionally apply the plan to implement the actions and run the project's `audit`s. |
| 105 | + - Execute `sqlmesh run` on a schedule to ingest and transform new data |
| 106 | + |
| 107 | +??? question "Differences in running models?" |
| 108 | + dbt projects are executed with the commands `dbt run` (models only) or `dbt build` (models, tests, snapshots). |
| 109 | + |
| 110 | + In SQLMesh, the execution depends on whether the project’s contents have been modified since the last execution: |
| 111 | + |
| 112 | + - If they have been modified, the `sqlmesh plan` command both: |
| 113 | + 1. Generates a summary of the actions that will occur to implement the code changes and |
| 114 | + 2. Prompts the user to "apply" the plan and execute those actions. |
| 115 | + - If they have not been modified, the [`sqlmesh run`](./reference/cli.md#run) command will evaluate the project models and run the audits. SQLMesh determines which project models should be executed based on their [`cron` configuration parameter](./concepts/models/overview.md#cron). |
| 116 | + |
| 117 | + For example, if a model’s `cron` is `daily` then `sqlmesh run` will only execute the model once per day. If you issue `sqlmesh run` the first time on a day the model will execute; if you issue `sqlmesh run` again nothing will happen because the model shouldn’t be executed again until tomorrow. |
| 118 | + |
| 119 | +??? question "Differences in state management?" |
| 120 | + **dbt** |
| 121 | + |
| 122 | + By default, dbt runs/builds are independent and have no knowledge of previous runs/builds. This knowledge is called “state” (as in “the state of things”). |
| 123 | + |
| 124 | + dbt has the ability to store/maintain state with the `state` selector method and the `defer` feature. dbt stores state information in `artifacts` like the manifest JSON file and reads the files at runtime. |
| 125 | + |
| 126 | + The dbt documentation [“Caveats to state comparison” page](https://docs.getdbt.com/reference/node-selection/state-comparison-caveats) comments on those features: “The state: selection method is a powerful feature, with a lot of underlying complexity.” |
| 127 | + |
| 128 | + **SQLMesh** |
| 129 | + |
| 130 | + SQLMesh always maintains state about the project structure, contents, and past runs. State information enables powerful SQLMesh features like virtual data environments and easy incremental loads. |
| 131 | + |
| 132 | + State information is stored by default - you do not need to take any action to maintain or to use it when executing models. As the dbt caveats page says, state information is powerful but complex. SQLMesh handles that complexity for you so you don't need to learn about or understand the underlying mechanics. |
| 133 | + |
| 134 | + SQLMesh stores state information in database tables. By default, it stores this information in the same [database/connection where your project models run](./reference/configuration.md#gateways). You can specify a [different database/connection](./reference/configuration.md#state-connection) if you would prefer to store state information somewhere else. |
| 135 | + |
| 136 | + SQLMesh adds information to the state tables via transactions, and some databases like BigQuery are not optimized to execute transactions. Changing the state connection to another database like PostgreSQL can alleviate performance issues you may encounter due to state transactions. |
0 commit comments