Skip to content

Commit 2bb19a8

Browse files
authored
Merge branch 'master' into fix-key-columns-handling
2 parents 2a78561 + 101a488 commit 2bb19a8

File tree

9 files changed

+121
-102
lines changed

9 files changed

+121
-102
lines changed

README.md

Lines changed: 86 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,118 @@
1-
<p align="center">
2-
<img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="50%" />
1+
<p align="left">
2+
<a href="https://datafold.com/"><img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="30%" /></a>
33
</p>
44

5-
<h1 align="center">
6-
data-diff
5+
<h1 align="left">
6+
data-diff: compare datasets fast, within or across SQL databases
77
</h1>
88

9-
<h2 align="center">
10-
Develop dbt models faster by testing as you code.
11-
</h2>
12-
<h4 align="center">
13-
See how every change to dbt code affects the data produced in the modified model and downstream.
14-
</h4>
159
<br>
1610

17-
## What is `data-diff`?
11+
# How it works
1812

19-
data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.
13+
When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
2014

21-
<div align="center">
15+
## joindiff
16+
- Recommended for comparing data within the same database
17+
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
18+
- Fully relies on the underlying database engine for computation
19+
- Requires both datasets to be queryable with a single SQL query
20+
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
21+
22+
## hashdiff
23+
- Recommended for comparing datasets across different databases
24+
- Can also be helpful in diffing very large tables with few expected differences within the same database
25+
- Employs a divide-and-conquer algorithm based on hashing and binary search
26+
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
27+
- Time complexity approximates COUNT(*) operation when there are few differences
28+
- Performance degrades when datasets have a large number of differences
2229

23-
![development_testing_gif](https://user-images.githubusercontent.com/1799931/236354286-d1d044cf-2168-4128-8a21-8c8ca7fd494c.gif)
30+
More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)
2431

25-
</div>
32+
# Get started
2633

27-
<br>
28-
29-
:eyes: **Watch 4-min demo video [here](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
30-
31-
## Getting Started
34+
Install `data-diff` with specific database adapters, e.g.:
3235

33-
**Install `data-diff`**
34-
35-
Install `data-diff` with the command that is specific to the database you use with dbt.
36-
37-
### Snowflake
3836
```
39-
pip install data-diff 'data-diff[snowflake,dbt]' -U
37+
pip install data-diff 'data-diff[postgresql,snowflake]' -U
4038
```
4139

42-
### BigQuery
40+
Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:
4341
```
44-
pip install data-diff 'data-diff[dbt]' google-cloud-bigquery -U
42+
data-diff \
43+
postgresql://<username>:'<password>'@localhost:5432/<database> \
44+
<table> \
45+
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
46+
<TABLE> \
47+
-k <primary key column> \
48+
-c <columns to compare> \
49+
-w <filter condition>
4550
```
4651

47-
### Redshift
48-
```
49-
pip install data-diff 'data-diff[redshift,dbt]' -U
50-
```
52+
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
5153

52-
### Postgres
53-
```
54-
pip install data-diff 'data-diff[postgres,dbt]' -U
55-
```
5654

57-
### Databricks
58-
```
59-
pip install data-diff 'data-diff[databricks,dbt]' -U
60-
```
55+
# Use cases
6156

62-
### DuckDB
63-
```
64-
pip install data-diff 'data-diff[duckdb,dbt]' -U
65-
```
57+
## Data Migration & Replication Testing
58+
Compare source to target and check for discrepancies when moving data between systems:
59+
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
60+
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
61+
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
6662

67-
**Update a few lines in your `dbt_project.yml`**.
68-
```
69-
#dbt_project.yml
70-
vars:
71-
data_diff:
72-
prod_database: my_database
73-
prod_schema: my_default_schema
74-
```
7563

76-
**Run your first data diff!**
64+
## Data Development Testing
65+
Test SQL code and preview changes by comparing development/staging environment data to production:
66+
1. Make a change to some SQL code
67+
2. Run the SQL code to create a new dataset
68+
3. Compare the dataset with its production version or another iteration
69+
70+
<p align="left">
71+
<img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
72+
</p>
73+
74+
`data-diff` integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets.
75+
76+
:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
77+
78+
**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**
79+
80+
Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
81+
82+
# Supported databases
7783

78-
```
79-
dbt run && data-diff --dbt
80-
```
8184

82-
We recommend you get started by walking through [our simple setup instructions](https://docs.datafold.com/development_testing/open_source) which contain examples and details.
85+
| Database | Status | Connection string |
86+
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------|
87+
| PostgreSQL >=10 | 💚 | `postgresql://<user>:<password>@<host>:5432/<database>` |
88+
| MySQL | 💚 | `mysql://<user>:<password>@<hostname>:5432/<database>` |
89+
| Snowflake | 💚 | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
90+
| BigQuery | 💚 | `bigquery://<project>/<dataset>` |
91+
| Redshift | 💚 | `redshift://<username>:<password>@<hostname>:5439/<database>` |
92+
| Oracle | 💛 | `oracle://<username>:<password>@<hostname>/database` |
93+
| Presto | 💛 | `presto://<username>:<password>@<hostname>:8080/<database>` |
94+
| Databricks | 💛 | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` |
95+
| Trino | 💛 | `trino://<username>:<password>@<hostname>:8080/<database>` |
96+
| Clickhouse | 💛 | `clickhouse://<username>:<password>@<hostname>:9000/<database>` |
97+
| Vertica | 💛 | `vertica://<username>:<password>@<hostname>:5433/<database>` |
98+
| DuckDB | 💛 | |
99+
| ElasticSearch | 📝 | |
100+
| Planetscale | 📝 | |
101+
| Pinot | 📝 | |
102+
| Druid | 📝 | |
103+
| Kafka | 📝 | |
104+
| SQLite | 📝 | |
83105

84-
Please reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) if you have any trouble whatsoever getting started!
106+
* 💚: Implemented and thoroughly tested.
107+
* 💛: Implemented, but not thoroughly tested yet.
108+
* ⏳: Implementation in progress.
109+
* 📝: Implementation planned. Contributions welcome.
85110

86-
<br><br>
111+
Your database not listed here?
87112

88-
### Diffing between databases
113+
- Contribute a [new database adapter](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst) – we accept pull requests!
114+
- [Get in touch](https://www.datafold.com/demo) about enterprise support and adding new adapters and features
89115

90-
Check out our [documentation](https://docs.datafold.com/reference/open_source/cli) if you're looking to compare data across databases (for example, between Postgres and Snowflake).
91116

92117
<br>
93118

data_diff/__main__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ def _get_log_handlers(is_dbt: Optional[bool] = False) -> Dict[str, logging.Handl
4848
rich_handler.setLevel(logging.WARN)
4949
handlers["rich_handler"] = rich_handler
5050

51-
# only use log_status_handler in a terminal
52-
if rich_handler.console.is_terminal and is_dbt:
51+
# only use log_status_handler in an interactive terminal session
52+
if rich_handler.console.is_interactive and is_dbt:
5353
log_status_handler = LogStatusHandler()
5454
log_status_handler.setFormatter(logging.Formatter(log_format_status, datefmt=date_format))
5555
log_status_handler.setLevel(logging.DEBUG)

data_diff/dbt.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from .format import jsonify, jsonify_error
2424
from .tracking import (
2525
bool_ask_for_email,
26+
bool_notify_about_extension,
2627
create_email_signup_event_json,
2728
set_entrypoint_name,
2829
set_dbt_user_id,
@@ -48,6 +49,7 @@
4849

4950
logger = getLogger(__name__)
5051
CLOUD_DOC_URL = "https://docs.datafold.com/development_testing/cloud"
52+
EXTENSION_INSTALL_URL = "https://get.datafold.com/datafold-vs-code-install"
5153

5254

5355
class TDiffVars(pydantic.BaseModel):
@@ -155,6 +157,8 @@ def dbt_diff(
155157
for thread in diff_threads:
156158
thread.join()
157159

160+
_extension_notification()
161+
158162

159163
def _get_diff_vars(
160164
dbt_parser: "DbtParser",
@@ -517,3 +521,10 @@ def _email_signup() -> None:
517521
if email:
518522
event_json = create_email_signup_event_json(email)
519523
run_as_daemon(send_event_json, event_json)
524+
525+
526+
def _extension_notification() -> None:
527+
if bool_notify_about_extension():
528+
rich.print(
529+
f"\n\nHaving a good time diffing? :heart_eyes-emoji:\nMake sure to check out the free [bold]Datafold VS Code extension[/bold] for more a more seamless diff experience:\n{EXTENSION_INSTALL_URL}"
530+
)

data_diff/dbt_parser.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ def try_set_dbt_flags():
6262
PROJECT_FILE = "dbt_project.yml"
6363
PROFILES_FILE = "profiles.yml"
6464
LOWER_DBT_V = "1.0.0"
65-
UPPER_DBT_V = "1.6.0"
65+
UPPER_DBT_V = "1.7.0"
6666

6767

6868
# https://github.com/dbt-labs/dbt-core/blob/c952d44ec5c2506995fbad75320acbae49125d3d/core/dbt/cli/resolvers.py#L6

data_diff/tracking.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
import urllib.request
1212
from uuid import uuid4
1313
import toml
14+
from rich import get_console
1415

1516
from .version import __version__
1617

@@ -61,6 +62,17 @@ def bool_ask_for_email() -> bool:
6162
return False
6263

6364

65+
def bool_notify_about_extension() -> bool:
66+
profile = _load_profile()
67+
console = get_console()
68+
if "notified_about_extension" not in profile and console.is_interactive:
69+
profile["notified_about_extension"] = ""
70+
with open(DEFAULT_PROFILE, "w") as conf:
71+
toml.dump(profile, conf)
72+
return True
73+
return False
74+
75+
6476
g_tracking_enabled = True
6577
g_anonymous_id = None
6678

data_diff/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.7.14"
1+
__version__ = "0.8.1"

docs/supported-databases.md

Lines changed: 0 additions & 29 deletions
This file was deleted.

poetry.lock

Lines changed: 5 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "data-diff"
3-
version = "0.7.14"
3+
version = "0.8.1"
44
description = "Command-line tool and Python library to efficiently diff rows across two different databases."
55
authors = ["Datafold <data-diff@datafold.com>"]
66
license = "MIT"
@@ -37,7 +37,7 @@ trino = {version="^0.314.0", optional=true}
3737
presto-python-client = {version="*", optional=true}
3838
clickhouse-driver = {version="*", optional=true}
3939
duckdb = {version="^0.7.0", optional=true}
40-
dbt-artifacts-parser = {version="^0.3.0"}
40+
dbt-artifacts-parser = {version="^0.4.0"}
4141
dbt-core = {version="^1.0.0"}
4242
keyring = "*"
4343
tabulate = "^0.9.0"
@@ -59,7 +59,7 @@ presto-python-client = "*"
5959
clickhouse-driver = "*"
6060
vertica-python = "*"
6161
duckdb = "^0.7.0"
62-
dbt-artifacts-parser = "^0.3.0"
62+
dbt-artifacts-parser = "^0.4.0"
6363
dbt-core = "^1.0.0"
6464
# google-cloud-bigquery = "*"
6565
# databricks-sql-connector = "*"

0 commit comments

Comments
 (0)