|
1 |
| -<p align="center"> |
2 |
| - <img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="50%" /> |
| 1 | +<p align="left"> |
| 2 | + <a href="https://datafold.com/"><img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="30%" /></a> |
3 | 3 | </p>
|
4 | 4 |
|
5 |
| -<h1 align="center"> |
6 |
| -data-diff |
| 5 | +<h1 align="left"> |
| 6 | +data-diff: compare datasets fast, within or across SQL databases |
7 | 7 | </h1>
|
8 | 8 |
|
9 |
| -<h2 align="center"> |
10 |
| -Develop dbt models faster by testing as you code. |
11 |
| -</h2> |
12 |
| -<h4 align="center"> |
13 |
| -See how every change to dbt code affects the data produced in the modified model and downstream. |
14 |
| -</h4> |
15 | 9 | <br>
|
16 | 10 |
|
17 |
| -## What is `data-diff`? |
| 11 | +# How it works |
18 | 12 |
|
19 |
| -data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code. |
| 13 | +When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison: |
20 | 14 |
|
21 |
| -<div align="center"> |
| 15 | +## joindiff |
| 16 | +- Recommended for comparing data within the same database |
| 17 | +- Uses the outer join operation to diff the rows as efficiently as possible within the same database |
| 18 | +- Fully relies on the underlying database engine for computation |
| 19 | +- Requires both datasets to be queryable with a single SQL query |
| 20 | +- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset |
| 21 | + |
| 22 | +## hashdiff |
| 23 | +- Recommended for comparing datasets across different databases |
| 24 | +- Can also be helpful in diffing very large tables with few expected differences within the same database |
| 25 | +- Employs a divide-and-conquer algorithm based on hashing and binary search |
| 26 | +- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake |
| 27 | +- Time complexity approximates COUNT(*) operation when there are few differences |
| 28 | +- Performance degrades when datasets have a large number of differences |
22 | 29 |
|
23 |
| - |
| 30 | +More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) |
24 | 31 |
|
25 |
| -</div> |
| 32 | +# Get started |
26 | 33 |
|
27 |
| -<br> |
28 |
| - |
29 |
| -:eyes: **Watch 4-min demo video [here](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** |
30 |
| - |
31 |
| -## Getting Started |
| 34 | +Install `data-diff` with specific database adapters, e.g.: |
32 | 35 |
|
33 |
| -**Install `data-diff`** |
34 |
| - |
35 |
| -Install `data-diff` with the command that is specific to the database you use with dbt. |
36 |
| - |
37 |
| -### Snowflake |
38 | 36 | ```
|
39 |
| -pip install data-diff 'data-diff[snowflake,dbt]' -U |
| 37 | +pip install data-diff 'data-diff[postgresql,snowflake]' -U |
40 | 38 | ```
|
41 | 39 |
|
42 |
| -### BigQuery |
| 40 | +Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm: |
43 | 41 | ```
|
44 |
| -pip install data-diff 'data-diff[dbt]' google-cloud-bigquery -U |
| 42 | +data-diff \ |
| 43 | + postgresql://<username>:'<password>'@localhost:5432/<database> \ |
| 44 | + <table> \ |
| 45 | + "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \ |
| 46 | + <TABLE> \ |
| 47 | + -k <primary key column> \ |
| 48 | + -c <columns to compare> \ |
| 49 | + -w <filter condition> |
45 | 50 | ```
|
46 | 51 |
|
47 |
| -### Redshift |
48 |
| -``` |
49 |
| -pip install data-diff 'data-diff[redshift,dbt]' -U |
50 |
| -``` |
| 52 | +Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference. |
51 | 53 |
|
52 |
| -### Postgres |
53 |
| -``` |
54 |
| -pip install data-diff 'data-diff[postgres,dbt]' -U |
55 |
| -``` |
56 | 54 |
|
57 |
| -### Databricks |
58 |
| -``` |
59 |
| -pip install data-diff 'data-diff[databricks,dbt]' -U |
60 |
| -``` |
| 55 | +# Use cases |
61 | 56 |
|
62 |
| -### DuckDB |
63 |
| -``` |
64 |
| -pip install data-diff 'data-diff[duckdb,dbt]' -U |
65 |
| -``` |
| 57 | +## Data Migration & Replication Testing |
| 58 | +Compare source to target and check for discrepancies when moving data between systems: |
| 59 | +- Migrating to a new data warehouse (e.g., Oracle > Snowflake) |
| 60 | +- Converting SQL to a new transformation framework (e.g., stored procedures > dbt) |
| 61 | +- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift) |
66 | 62 |
|
67 |
| -**Update a few lines in your `dbt_project.yml`**. |
68 |
| -``` |
69 |
| -#dbt_project.yml |
70 |
| -vars: |
71 |
| - data_diff: |
72 |
| - prod_database: my_database |
73 |
| - prod_schema: my_default_schema |
74 |
| -``` |
75 | 63 |
|
76 |
| -**Run your first data diff!** |
| 64 | +## Data Development Testing |
| 65 | +Test SQL code and preview changes by comparing development/staging environment data to production: |
| 66 | +1. Make a change to some SQL code |
| 67 | +2. Run the SQL code to create a new dataset |
| 68 | +3. Compare the dataset with its production version or another iteration |
| 69 | + |
| 70 | + <p align="left"> |
| 71 | + <img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" /> |
| 72 | + </p> |
| 73 | + |
| 74 | +`data-diff` integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets. |
| 75 | + |
| 76 | +:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** |
| 77 | + |
| 78 | +**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)** |
| 79 | + |
| 80 | +Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support |
| 81 | + |
| 82 | +# Supported databases |
77 | 83 |
|
78 |
| -``` |
79 |
| -dbt run && data-diff --dbt |
80 |
| -``` |
81 | 84 |
|
82 |
| -We recommend you get started by walking through [our simple setup instructions](https://docs.datafold.com/development_testing/open_source) which contain examples and details. |
| 85 | +| Database | Status | Connection string | |
| 86 | +|---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------| |
| 87 | +| PostgreSQL >=10 | 💚 | `postgresql://<user>:<password>@<host>:5432/<database>` | |
| 88 | +| MySQL | 💚 | `mysql://<user>:<password>@<hostname>:5432/<database>` | |
| 89 | +| Snowflake | 💚 | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` | |
| 90 | +| BigQuery | 💚 | `bigquery://<project>/<dataset>` | |
| 91 | +| Redshift | 💚 | `redshift://<username>:<password>@<hostname>:5439/<database>` | |
| 92 | +| Oracle | 💛 | `oracle://<username>:<password>@<hostname>/database` | |
| 93 | +| Presto | 💛 | `presto://<username>:<password>@<hostname>:8080/<database>` | |
| 94 | +| Databricks | 💛 | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` | |
| 95 | +| Trino | 💛 | `trino://<username>:<password>@<hostname>:8080/<database>` | |
| 96 | +| Clickhouse | 💛 | `clickhouse://<username>:<password>@<hostname>:9000/<database>` | |
| 97 | +| Vertica | 💛 | `vertica://<username>:<password>@<hostname>:5433/<database>` | |
| 98 | +| DuckDB | 💛 | | |
| 99 | +| ElasticSearch | 📝 | | |
| 100 | +| Planetscale | 📝 | | |
| 101 | +| Pinot | 📝 | | |
| 102 | +| Druid | 📝 | | |
| 103 | +| Kafka | 📝 | | |
| 104 | +| SQLite | 📝 | | |
83 | 105 |
|
84 |
| -Please reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) if you have any trouble whatsoever getting started! |
| 106 | +* 💚: Implemented and thoroughly tested. |
| 107 | +* 💛: Implemented, but not thoroughly tested yet. |
| 108 | +* ⏳: Implementation in progress. |
| 109 | +* 📝: Implementation planned. Contributions welcome. |
85 | 110 |
|
86 |
| -<br><br> |
| 111 | +Your database not listed here? |
87 | 112 |
|
88 |
| -### Diffing between databases |
| 113 | +- Contribute a [new database adapter](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst) – we accept pull requests! |
| 114 | +- [Get in touch](https://www.datafold.com/demo) about enterprise support and adding new adapters and features |
89 | 115 |
|
90 |
| -Check out our [documentation](https://docs.datafold.com/reference/open_source/cli) if you're looking to compare data across databases (for example, between Postgres and Snowflake). |
91 | 116 |
|
92 | 117 | <br>
|
93 | 118 |
|
|
0 commit comments