Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Update README to be ergonomic and excite new users #816

Merged
merged 11 commits into from
Jan 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 86 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,79 +9,61 @@ data-diff: Compare datasets fast, within or across SQL databases
</h2>
<br>

> [Make sure to join us at our virtual hands-on lab series where our team walks through live how to get set-up with it!](https://www.datafold.com/virtual-hands-on-lab)
> [Join our live virtual lab series to learn how to set it up!](https://www.datafold.com/virtual-hands-on-lab)

# Use Cases
# What's a Data Diff?
A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.

There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases.

## Data Migration & Replication Testing
Compare source to target and check for discrepancies when moving data between systems:
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
# Use Cases

### Data Migration & Replication Testing
data-diff is a powerful tool for comparing data when you're moving it between systems. Use it to ensure data accuracy and identify discrepancies during tasks like:
- **Migrating** to a new data warehouse (e.g., Oracle -> Snowflake)
- **Converting SQL** to a new transformation framework (e.g., stored procedures -> dbt)
- Continuously **replicating data** from an OLTP database to OLAP data warehouse (e.g., MySQL -> Redshift)

## Data Development Testing
Test SQL code and preview changes by comparing development/staging environment data to production:
1. Make a change to some SQL code
### Data Development Testing
When developing SQL code, data-diff helps you validate and preview changes by comparing data between development/staging environments and production. Here's how it works:
1. Make a change to your SQL code
2. Run the SQL code to create a new dataset
3. Compare the dataset with its production version or another iteration
3. Compare this dataset with its production version or other iterations

# dbt Integration
<p align="left">
<img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
</p>

<details>
<summary> data-diff integrates with dbt Core to seamlessly compare local development to production datasets
</p>

</summary>
data-diff integrates with [dbt Core](https://github.com/dbt-labs/dbt-core) to seamlessly compare local development to production datasets.

![data-development-testing](docs/development_testing.png)
Learn more about how data-diff works with dbt:
* Read our docs to get started with [data-diff & dbt](https://docs.datafold.com/development_testing/cli) or :eyes: **watch the [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
* dbt Cloud users should check out [Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
* Get support from the dbt Community Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU)

</details>

> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
# Getting Started

:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
### ⚡ Validating dbt model changes between dev and prod
Looking to use data-diff in dbt development?

**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**

Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support


# How it works
Development testing with Datafold enables you to see the impact of dbt code changes on data as you write the code, whether in your IDE or CLI.

When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/cli) to get started with a development testing workflow!

## `joindiff`
- Recommended for comparing data within the same database
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
- Fully relies on the underlying database engine for computation
- Requires both datasets to be queryable with a single SQL query
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
### 🔀 Compare data tables between databases
1. Install `data-diff` with adapters

## `hashdiff`
- Recommended for comparing datasets across different databases
- Can also be helpful in diffing very large tables with few expected differences within the same database
- Employs a divide-and-conquer algorithm based on hashing and binary search
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
- Time complexity approximates COUNT(*) operation when there are few differences
- Performance degrades when datasets have a large number of differences

More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)

# Get started

## Validating dbt model changes between dev and prod
⚡ Looking to use `data-diff` in dbt development? Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/how_it_works) to get started!

## Compare data tables between databases
🔀 To compare data between databases, install `data-diff` with specific database adapters, e.g.:
To compare data between databases, install `data-diff` with specific database adapters. For example, install it for PostgreSQL and Snowflake like this:

```
pip install data-diff 'data-diff[postgresql,snowflake]' -U
```

Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
2. Run `data-diff` with connection URIs

Then, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:

```bash
data-diff \
Expand All @@ -93,8 +75,9 @@ data-diff \
-c <columns to compare> \
-w <filter condition>
```
3. Set up your configuration

Run `data-diff` with a `toml` configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:
You can use a `toml` configuration file to run your `data-diff` job. In this example, we compare tables between MotherDuck (hosted DuckDB) and Snowflake using the hashdiff algorithm:

```toml
## DATABASE CONNECTION ##
Expand All @@ -103,7 +86,6 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c
# filepath = "datafold_demo.duckdb" # local duckdb file example
# filepath = "md:" # default motherduck connection example
filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
database = "datafold_demo"

[database.snowflake_connection]
driver = "snowflake"
Expand Down Expand Up @@ -132,8 +114,12 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c

verbose = false
```
4. Run your `data-diff` job

Make sure to export relevant environment variables as needed. For example, we compare data based on the earlier configuration:

```bash

# export relevant environment variables, example below
export motherduck_token=<MOTHERDUCK_TOKEN>

Expand All @@ -148,11 +134,13 @@ data-diff --conf datadiff.toml \
+ 1, returned
```

Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
5. Review the output

After running your data-diff job, review the output to identify and analyze differences in your data.

# Supported databases
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.

# Supported databases

| Database | Status | Connection string |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
Expand All @@ -161,8 +149,8 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f
| Snowflake | 🟢 | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
| BigQuery | 🟢 | `bigquery://<project>/<dataset>` |
| Redshift | 🟢 | `redshift://<username>:<password>@<hostname>:5439/<database>` |
| DuckDB | 🟢 | `duckdb://<dbname>@<filepath>` |
| MotherDuck | 🟢 | `duckdb://<dbname>@<filepath>` |
| DuckDB | 🟢 | `duckdb://<filepath>` |
| MotherDuck | 🟢 | `duckdb://<filepath>` |
| Oracle | 🟡 | `oracle://<username>:<password>@<hostname>/servive_or_sid` |
| Presto | 🟡 | `presto://<username>:<password>@<hostname>:8080/<database>` |
| Databricks | 🟡 | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` |
Expand All @@ -172,8 +160,7 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f
| ElasticSearch | 📝 | |
| Planetscale | 📝 | |
| Pinot | 📝 | |
| Druid | 📝 | |
| Kafka | 📝 | |
| Druid | 📝 | | |
| SQLite | 📝 | |

* 🟢: Implemented and thoroughly tested.
Expand All @@ -189,9 +176,48 @@ Your database not listed here?

<br>

# How it works

`data-diff` efficiently compares data using two modes:

**joindiff**: Ideal for comparing data within the same database, utilizing outer joins for efficient row comparisons. It relies on the database engine for computation and has consistent performance.

**hashdiff**: Recommended for comparing datasets across different databases or large tables with minimal differences. It uses hashing and binary search, capable of diffing data across distinct database engines.

<details>
<summary>Click here to learn more about joindiff and hashdiff</summary>

### `joindiff`
* Recommended for comparing data within the same database
* Uses the outer join operation to diff the rows as efficiently as possible within the same database
* Fully relies on the underlying database engine for computation
* Requires both datasets to be queryable with a single SQL query
* Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset

### `hashdiff`:
* Recommended for comparing datasets across different databases
* Can also be helpful in diffing very large tables with few expected differences within the same database
* Employs a divide-and-conquer algorithm based on hashing and binary search
* Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
* Time complexity approximates COUNT(*) operation when there are few differences
* Performance degrades when datasets have a large number of differences

</details>
<br>

For detailed algorithm and performance insights, explore [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md), or head to our docs to [learn more about how Datafold diffs data](https://docs.datafold.com/data_diff/how-datafold-diffs-data).


# data-diff OSS & Datafold Cloud
data-diff is an open source utility for running stateless diffs on your local computer for a great single player experience.

Scale up with [Datafold Cloud](https://www.datafold.com/) to make data diffing a company-wide experience to both supercharge your data diffing CLI experience (ex: data-diff --dbt --cloud) and run diffs manually in the UI. This includes [column-level lineage](https://www.datafold.com/column-level-lineage), [CI testing](https://docs.datafold.com/deployment_testing/how_it_works/), and diff history.

## Contributors

We thank everyone who contributed so far!
We thank everyone who contributed so far!

We'd love to see your face here: [Contributing Instructions](CONTRIBUTING.md)

<a href="https://github.com/datafold/data-diff/graphs/contributors">
<img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />
Expand Down