You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 17, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+13-15Lines changed: 13 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -5,11 +5,11 @@
5
5
# **data-diff**
6
6
7
7
## What is `data-diff`?
8
-
data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables. It's fast, easy to use, and reliable. Even at massive scale.
8
+
data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables.
9
9
10
10
## Documentation
11
11
12
-
[**🗎 Documentation website**](https://docs.datafold.com/os_diff/about) - our detailed documentation has everything you need to start diffing.
12
+
[**🗎 Documentation**](https://docs.datafold.com/guides/os_data_diff) - our detailed documentation has everything you need to start diffing.
13
13
14
14
### Databases we support
15
15
@@ -27,7 +27,7 @@ data-diff is a **free, open-source tool** that enables data professionals to det
27
27
- DuckDB >=0.6
28
28
- SQLite (coming soon)
29
29
30
-
For their corresponding connection strings, check out our [detailed table](https://docs.datafold.com/os_diff/databases_we_support).
30
+
For their corresponding connection strings, check out our [detailed table](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md).
31
31
32
32
#### Looking for a database not on the list?
33
33
If a database is not on the list, we'd still love to support it. [Please open an issue](https://github.com/datafold/data-diff/issues) to discuss it, or vote on existing requests to push them up our todo list.
@@ -92,7 +92,7 @@ Once you've installed `data-diff`, you can run it from the command line.
Be sure to read [the docs](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line) for detailed instructions how to build one of these commands depending on your database setup.
95
+
Be sure to read [the docs](https://docs.datafold.com/reference/open_source/cli) for detailed instructions how to build one of these commands depending on your database setup.
96
96
97
97
#### Code Example: Diff Tables Between Databases
98
98
Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres.
@@ -110,8 +110,6 @@ data-diff \
110
110
111
111
#### Code Example: Diff Tables Within a Database
112
112
113
-
Here's a code example from [the video](https://www.loom.com/share/682e4b7d74e84eb4824b983311f0a3b2), where we compare data between two Snowflake tables within one database.
@@ -130,22 +128,19 @@ In both code examples, I've used `<>` carrots to represent values that **should
130
128
131
129
We know that in some cases, the data-diff command can become long and dense. And maybe you're new to the command line.
132
130
133
-
* We're here to help [on slack](https://locallyoptimistic.slack.com/archives/C03HUNGQV0S) if you have ANY questions as you use `data-diff` in your workflow.
131
+
* We're here to help [on slack](https://getdbt.slack.com/archives/C03D25A92UU) if you have ANY questions as you use `data-diff` in your workflow.
134
132
* You can also post a question in [GitHub Discussions](https://github.com/datafold/data-diff/discussions).
135
133
136
-
137
-
To get a Slack invite - [click here](https://locallyoptimistic.com/community/)
138
-
139
134
## How to Use
140
135
141
-
*[How to use from the shell (or: command-line)](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line)
142
-
*[How to use from Python](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_python)
143
-
*[How to use with TOML configuration file](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_toml)
144
-
*[Usage Analytics & Data Privacy](https://docs.datafold.com/os_diff/usage_analytics_data_privacy)
136
+
*[Examples with dbt, joindiff, and hashdiff](https://docs.datafold.com/reference/open_source/cli#examples)
137
+
*[Examples with Python](https://data-diff.readthedocs.io/en/latest/python-api.html)
138
+
*[How to use with TOML configuration file](https://docs.datafold.com/reference/open_source/cli#toml-config-file)
145
139
146
140
## How to Contribute
147
141
* Feel free to open an issue or contribute to the project by working on an existing issue.
148
142
* Please read the [contributing guidelines](https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md) to get started.
143
+
* To add a new database driver, check out [docs](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst).
149
144
150
145
Big thanks to everyone who contributed so far:
151
146
@@ -155,7 +150,10 @@ Big thanks to everyone who contributed so far:
155
150
156
151
## Technical Explanation
157
152
158
-
Check out this [technical explanation](https://docs.datafold.com/os_diff/technical_explanation) of how data-diff works.
153
+
Check out this [technical explanation](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) of how data-diff works.
154
+
155
+
## Analytics
156
+
*[Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)
-**Inspect differences between branches**. Make sure your code results in only expected changes.
5
+
-**Validate stability of critical downstream tables**. When refactoring a data pipeline, rest assured that the changes you make to upstream models have not impacted critical downstream models depended on by users and systems.
6
+
-**Conduct better code reviews**. No matter how thoughtfully you review the code, run a diff to ensure that you don't accidentally approve an error.
7
+
8
+
## hashdiff
9
+
-**Verify data migrations**. Verify that all data was copied when doing a critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.
10
+
-**Verify data pipelines**. Moving data from a relational database to a warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.
11
+
-**Maintain data integrity SLOs**. You can create and monitor your SLO of e.g. 99.999% data integrity, and alert your team when data is missing.
12
+
-**Debug complex data pipelines**. Data can get lost in pipelines that may span a half-dozen systems. data-diff helps you efficiently track down where a row got lost without needing to individually inspect intermediate datastores.
13
+
-**Detect hard deletes for an `updated_at`-based pipeline**. If you're copying data to your warehouse based on an `updated_at`-style column, data-diff can find any hard-deletes that you may have missed.
14
+
-**Make your replication self-healing**. You can use data-diff to self-heal by using the diff output to write/update rows in the target database.
data-diff collects anonymous usage data to help our team improve the tool and to apply development efforts to where our users need them most.
4
+
5
+
We capture two events: one when the data-diff run starts, and one when it is finished. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to:
6
+
7
+
- Operating System and Python version
8
+
- Types of databases used (postgresql, mysql, etc.)
9
+
- Sizes of tables diffed, run time, and diff row count (numbers only)
10
+
- Error message, if any, truncated to the first 20 characters.
11
+
- A persistent UUID to indentify the session, stored in `~/.datadiff.toml`
12
+
13
+
To disable, use one of the following methods:
14
+
15
+
***CLI**: use the `--no-tracking` flag.
16
+
***Config file**: set `no_tracking = true` (for example, under `[run.default]`)
17
+
***Python API**:
18
+
```python
19
+
import data_diff
20
+
# Invoke the following before making any API calls
0 commit comments