Skip to content

Commit 5ea6738

Browse files
authored
Merge branch 'master' into feat/stats-for-dbt
2 parents f37d903 + 101a488 commit 5ea6738

File tree

12 files changed

+207
-130
lines changed

12 files changed

+207
-130
lines changed

README.md

Lines changed: 86 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,118 @@
1-
<p align="center">
2-
<img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="50%" />
1+
<p align="left">
2+
<a href="https://datafold.com/"><img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="30%" /></a>
33
</p>
44

5-
<h1 align="center">
6-
data-diff
5+
<h1 align="left">
6+
data-diff: compare datasets fast, within or across SQL databases
77
</h1>
88

9-
<h2 align="center">
10-
Develop dbt models faster by testing as you code.
11-
</h2>
12-
<h4 align="center">
13-
See how every change to dbt code affects the data produced in the modified model and downstream.
14-
</h4>
159
<br>
1610

17-
## What is `data-diff`?
11+
# How it works
1812

19-
data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.
13+
When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
2014

21-
<div align="center">
15+
## joindiff
16+
- Recommended for comparing data within the same database
17+
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
18+
- Fully relies on the underlying database engine for computation
19+
- Requires both datasets to be queryable with a single SQL query
20+
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
21+
22+
## hashdiff
23+
- Recommended for comparing datasets across different databases
24+
- Can also be helpful in diffing very large tables with few expected differences within the same database
25+
- Employs a divide-and-conquer algorithm based on hashing and binary search
26+
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
27+
- Time complexity approximates COUNT(*) operation when there are few differences
28+
- Performance degrades when datasets have a large number of differences
2229

23-
![development_testing_gif](https://user-images.githubusercontent.com/1799931/236354286-d1d044cf-2168-4128-8a21-8c8ca7fd494c.gif)
30+
More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)
2431

25-
</div>
32+
# Get started
2633

27-
<br>
28-
29-
:eyes: **Watch 4-min demo video [here](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
30-
31-
## Getting Started
34+
Install `data-diff` with specific database adapters, e.g.:
3235

33-
**Install `data-diff`**
34-
35-
Install `data-diff` with the command that is specific to the database you use with dbt.
36-
37-
### Snowflake
3836
```
39-
pip install data-diff 'data-diff[snowflake,dbt]' -U
37+
pip install data-diff 'data-diff[postgresql,snowflake]' -U
4038
```
4139

42-
### BigQuery
40+
Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:
4341
```
44-
pip install data-diff 'data-diff[dbt]' google-cloud-bigquery -U
42+
data-diff \
43+
postgresql://<username>:'<password>'@localhost:5432/<database> \
44+
<table> \
45+
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
46+
<TABLE> \
47+
-k <primary key column> \
48+
-c <columns to compare> \
49+
-w <filter condition>
4550
```
4651

47-
### Redshift
48-
```
49-
pip install data-diff 'data-diff[redshift,dbt]' -U
50-
```
52+
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
5153

52-
### Postgres
53-
```
54-
pip install data-diff 'data-diff[postgres,dbt]' -U
55-
```
5654

57-
### Databricks
58-
```
59-
pip install data-diff 'data-diff[databricks,dbt]' -U
60-
```
55+
# Use cases
6156

62-
### DuckDB
63-
```
64-
pip install data-diff 'data-diff[duckdb,dbt]' -U
65-
```
57+
## Data Migration & Replication Testing
58+
Compare source to target and check for discrepancies when moving data between systems:
59+
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
60+
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
61+
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
6662

67-
**Update a few lines in your `dbt_project.yml`**.
68-
```
69-
#dbt_project.yml
70-
vars:
71-
data_diff:
72-
prod_database: my_database
73-
prod_schema: my_default_schema
74-
```
7563

76-
**Run your first data diff!**
64+
## Data Development Testing
65+
Test SQL code and preview changes by comparing development/staging environment data to production:
66+
1. Make a change to some SQL code
67+
2. Run the SQL code to create a new dataset
68+
3. Compare the dataset with its production version or another iteration
69+
70+
<p align="left">
71+
<img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
72+
</p>
73+
74+
`data-diff` integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets.
75+
76+
:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
77+
78+
**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**
79+
80+
Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
81+
82+
# Supported databases
7783

78-
```
79-
dbt run && data-diff --dbt
80-
```
8184

82-
We recommend you get started by walking through [our simple setup instructions](https://docs.datafold.com/development_testing/open_source) which contain examples and details.
85+
| Database | Status | Connection string |
86+
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------|
87+
| PostgreSQL >=10 | 💚 | `postgresql://<user>:<password>@<host>:5432/<database>` |
88+
| MySQL | 💚 | `mysql://<user>:<password>@<hostname>:5432/<database>` |
89+
| Snowflake | 💚 | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
90+
| BigQuery | 💚 | `bigquery://<project>/<dataset>` |
91+
| Redshift | 💚 | `redshift://<username>:<password>@<hostname>:5439/<database>` |
92+
| Oracle | 💛 | `oracle://<username>:<password>@<hostname>/database` |
93+
| Presto | 💛 | `presto://<username>:<password>@<hostname>:8080/<database>` |
94+
| Databricks | 💛 | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` |
95+
| Trino | 💛 | `trino://<username>:<password>@<hostname>:8080/<database>` |
96+
| Clickhouse | 💛 | `clickhouse://<username>:<password>@<hostname>:9000/<database>` |
97+
| Vertica | 💛 | `vertica://<username>:<password>@<hostname>:5433/<database>` |
98+
| DuckDB | 💛 | |
99+
| ElasticSearch | 📝 | |
100+
| Planetscale | 📝 | |
101+
| Pinot | 📝 | |
102+
| Druid | 📝 | |
103+
| Kafka | 📝 | |
104+
| SQLite | 📝 | |
83105

84-
Please reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) if you have any trouble whatsoever getting started!
106+
* 💚: Implemented and thoroughly tested.
107+
* 💛: Implemented, but not thoroughly tested yet.
108+
* ⏳: Implementation in progress.
109+
* 📝: Implementation planned. Contributions welcome.
85110

86-
<br><br>
111+
Your database not listed here?
87112

88-
### Diffing between databases
113+
- Contribute a [new database adapter](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst) – we accept pull requests!
114+
- [Get in touch](https://www.datafold.com/demo) about enterprise support and adding new adapters and features
89115

90-
Check out our [documentation](https://docs.datafold.com/reference/open_source/cli) if you're looking to compare data across databases (for example, between Postgres and Snowflake).
91116

92117
<br>
93118

data_diff/__main__.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
"-": "red",
3434
}
3535

36-
set_entrypoint_name("CLI")
36+
set_entrypoint_name(os.getenv("DATAFOLD_TRIGGERED_BY", "CLI"))
3737

3838

3939
def _get_log_handlers(is_dbt: Optional[bool] = False) -> Dict[str, logging.Handler]:
@@ -48,8 +48,8 @@ def _get_log_handlers(is_dbt: Optional[bool] = False) -> Dict[str, logging.Handl
4848
rich_handler.setLevel(logging.WARN)
4949
handlers["rich_handler"] = rich_handler
5050

51-
# only use log_status_handler in a terminal
52-
if rich_handler.console.is_terminal and is_dbt:
51+
# only use log_status_handler in an interactive terminal session
52+
if rich_handler.console.is_interactive and is_dbt:
5353
log_status_handler = LogStatusHandler()
5454
log_status_handler.setFormatter(logging.Formatter(log_format_status, datefmt=date_format))
5555
log_status_handler.setLevel(logging.DEBUG)
@@ -318,6 +318,7 @@ def main(conf, run, **kw):
318318
state=state,
319319
where_flag=kw["where"],
320320
stats_flag=kw["stats"],
321+
columns_flag=kw["columns"],
321322
)
322323
else:
323324
return _data_diff(

data_diff/cloud/datafold_api.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
import pydantic
88
import requests
99

10-
from data_diff.errors import DataDiffDatasourceIdNotFoundError
10+
from data_diff.errors import DataDiffCloudDiffFailed, DataDiffCloudDiffTimedOut, DataDiffDatasourceIdNotFoundError
1111

1212
from ..utils import getLogger
1313

@@ -248,8 +248,8 @@ def create_data_diff(self, payload: TCloudApiDataDiff) -> int:
248248
def poll_data_diff_results(self, diff_id: int) -> TCloudApiDataDiffSummaryResult:
249249
summary_results = None
250250
start_time = time.monotonic()
251-
sleep_interval = 5 # starts at 5 sec
252-
max_sleep_interval = 30
251+
sleep_interval = 3
252+
max_sleep_interval = 20
253253
max_wait_time = 300
254254

255255
diff_url = f"{self.host}/datadiffs/{diff_id}/overview"
@@ -260,13 +260,15 @@ def poll_data_diff_results(self, diff_id: int) -> TCloudApiDataDiffSummaryResult
260260
if response_json["status"] == "success":
261261
summary_results = response_json
262262
elif response_json["status"] == "failed":
263-
raise Exception(f"Diff failed: {str(response_json)}")
263+
raise DataDiffCloudDiffFailed(f"Diff failed: {str(response_json)}")
264264

265265
if time.monotonic() - start_time > max_wait_time:
266-
raise Exception(f"Timed out waiting for diff results. Please, go to the UI for details: {diff_url}")
266+
raise DataDiffCloudDiffTimedOut(
267+
f"Timed out waiting for diff results. Please, go to the UI for details: {diff_url}"
268+
)
267269

268270
time.sleep(sleep_interval)
269-
sleep_interval = min(sleep_interval * 2, max_sleep_interval)
271+
sleep_interval = min(sleep_interval + 1, max_sleep_interval)
270272

271273
return TCloudApiDataDiffSummaryResult.from_orm(summary_results)
272274

data_diff/dbt.py

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from .format import jsonify, jsonify_error
2424
from .tracking import (
2525
bool_ask_for_email,
26+
bool_notify_about_extension,
2627
create_email_signup_event_json,
2728
set_entrypoint_name,
2829
set_dbt_user_id,
@@ -48,6 +49,7 @@
4849

4950
logger = getLogger(__name__)
5051
CLOUD_DOC_URL = "https://docs.datafold.com/development_testing/cloud"
52+
EXTENSION_INSTALL_URL = "https://get.datafold.com/datafold-vs-code-install"
5153

5254

5355
class TDiffVars(pydantic.BaseModel):
@@ -73,10 +75,11 @@ def dbt_diff(
7375
log_status_handler: Optional[LogStatusHandler] = None,
7476
where_flag: Optional[str] = None,
7577
stats_flag: bool = False,
78+
columns_flag: Optional[Tuple[str]] = None,
7679
) -> None:
7780
print_version_info()
7881
diff_threads = []
79-
set_entrypoint_name("CLI-dbt")
82+
set_entrypoint_name(os.getenv("DATAFOLD_TRIGGERED_BY", "CLI-dbt"))
8083
dbt_parser = DbtParser(profiles_dir_override, project_dir_override, state)
8184
models = dbt_parser.get_models(dbt_selection)
8285
config = dbt_parser.get_datadiff_config()
@@ -112,7 +115,7 @@ def dbt_diff(
112115
if log_status_handler:
113116
log_status_handler.set_prefix(f"Diffing {model.alias} \n")
114117

115-
diff_vars = _get_diff_vars(dbt_parser, config, model, where_flag, stats_flag)
118+
diff_vars = _get_diff_vars(dbt_parser, config, model, where_flag, stats_flag, columns_flag)
116119

117120
# we won't always have a prod path when using state
118121
# when the model DNE in prod manifest, skip the model diff
@@ -156,32 +159,36 @@ def dbt_diff(
156159
for thread in diff_threads:
157160
thread.join()
158161

162+
_extension_notification()
163+
159164

160165
def _get_diff_vars(
161166
dbt_parser: "DbtParser",
162167
config: TDatadiffConfig,
163168
model,
164169
where_flag: Optional[str] = None,
165170
stats_flag: bool = False,
171+
columns_flag: Optional[Tuple[str]] = None,
166172
) -> TDiffVars:
173+
cli_columns = list(columns_flag) if columns_flag else []
167174
dev_database = model.database
168175
dev_schema = model.schema_
169-
176+
dev_alias = prod_alias = model.alias
170177
primary_keys = dbt_parser.get_pk_from_model(model, dbt_parser.unique_columns, "primary-key")
171178

172179
# prod path is constructed via configuration or the prod manifest via --state
173180
if dbt_parser.prod_manifest_obj:
174-
prod_database, prod_schema = _get_prod_path_from_manifest(model, dbt_parser.prod_manifest_obj)
181+
prod_database, prod_schema, prod_alias = _get_prod_path_from_manifest(model, dbt_parser.prod_manifest_obj)
175182
else:
176183
prod_database, prod_schema = _get_prod_path_from_config(config, model, dev_database, dev_schema)
177184

178185
if dbt_parser.requires_upper:
179-
dev_qualified_list = [x.upper() for x in [dev_database, dev_schema, model.alias] if x]
180-
prod_qualified_list = [x.upper() for x in [prod_database, prod_schema, model.alias] if x]
186+
dev_qualified_list = [x.upper() for x in [dev_database, dev_schema, dev_alias] if x]
187+
prod_qualified_list = [x.upper() for x in [prod_database, prod_schema, prod_alias] if x]
181188
primary_keys = [x.upper() for x in primary_keys]
182189
else:
183-
dev_qualified_list = [x for x in [dev_database, dev_schema, model.alias] if x]
184-
prod_qualified_list = [x for x in [prod_database, prod_schema, model.alias] if x]
190+
dev_qualified_list = [x for x in [dev_database, dev_schema, dev_alias] if x]
191+
prod_qualified_list = [x for x in [prod_database, prod_schema, prod_alias] if x]
185192

186193
datadiff_model_config = dbt_parser.get_datadiff_model_config(model.meta)
187194

@@ -192,10 +199,10 @@ def _get_diff_vars(
192199
primary_keys=primary_keys,
193200
connection=dbt_parser.connection,
194201
threads=dbt_parser.threads,
195-
# --where takes precedence over any model level config
202+
# cli flags take precedence over any model level config
196203
where_filter=where_flag or datadiff_model_config.where_filter,
197-
include_columns=datadiff_model_config.include_columns,
198-
exclude_columns=datadiff_model_config.exclude_columns,
204+
include_columns=cli_columns or datadiff_model_config.include_columns,
205+
exclude_columns=[] if cli_columns else datadiff_model_config.exclude_columns,
199206
stats_flag=stats_flag,
200207
)
201208

@@ -229,14 +236,16 @@ def _get_prod_path_from_config(config, model, dev_database, dev_schema) -> Tuple
229236
return prod_database, prod_schema
230237

231238

232-
def _get_prod_path_from_manifest(model, prod_manifest) -> Union[Tuple[str, str], Tuple[None, None]]:
239+
def _get_prod_path_from_manifest(model, prod_manifest) -> Union[Tuple[str, str, str], Tuple[None, None, None]]:
233240
prod_database = None
234241
prod_schema = None
242+
prod_alias = None
235243
prod_model = prod_manifest.nodes.get(model.unique_id, None)
236244
if prod_model:
237245
prod_database = prod_model.database
238246
prod_schema = prod_model.schema_
239-
return prod_database, prod_schema
247+
prod_alias = prod_model.alias
248+
return prod_database, prod_schema, prod_alias
240249

241250

242251
def _local_diff(diff_vars: TDiffVars, json_output: bool = False) -> None:
@@ -517,3 +526,10 @@ def _email_signup() -> None:
517526
if email:
518527
event_json = create_email_signup_event_json(email)
519528
run_as_daemon(send_event_json, event_json)
529+
530+
531+
def _extension_notification() -> None:
532+
if bool_notify_about_extension():
533+
rich.print(
534+
f"\n\nHaving a good time diffing? :heart_eyes-emoji:\nMake sure to check out the free [bold]Datafold VS Code extension[/bold] for more a more seamless diff experience:\n{EXTENSION_INSTALL_URL}"
535+
)

0 commit comments

Comments
 (0)