Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

fix up links in readme #493

Merged
merged 5 commits into from
Apr 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 13 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
# **data-diff**

## What is `data-diff`?
data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables. It's fast, easy to use, and reliable. Even at massive scale.
data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables.

## Documentation

[**🗎 Documentation website**](https://docs.datafold.com/os_diff/about) - our detailed documentation has everything you need to start diffing.
[**🗎 Documentation**](https://docs.datafold.com/guides/os_data_diff) - our detailed documentation has everything you need to start diffing.

### Databases we support

Expand All @@ -27,7 +27,7 @@ data-diff is a **free, open-source tool** that enables data professionals to det
- DuckDB >=0.6
- SQLite (coming soon)

For their corresponding connection strings, check out our [detailed table](https://docs.datafold.com/os_diff/databases_we_support).
For their corresponding connection strings, check out our [detailed table](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md).

#### Looking for a database not on the list?
If a database is not on the list, we'd still love to support it. [Please open an issue](https://github.com/datafold/data-diff/issues) to discuss it, or vote on existing requests to push them up our todo list.
Expand Down Expand Up @@ -92,7 +92,7 @@ Once you've installed `data-diff`, you can run it from the command line.
data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]
```

Be sure to read [the docs](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line) for detailed instructions how to build one of these commands depending on your database setup.
Be sure to read [the docs](https://docs.datafold.com/reference/open_source/cli) for detailed instructions how to build one of these commands depending on your database setup.

#### Code Example: Diff Tables Between Databases
Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres.
Expand All @@ -110,8 +110,6 @@ data-diff \

#### Code Example: Diff Tables Within a Database

Here's a code example from [the video](https://www.loom.com/share/682e4b7d74e84eb4824b983311f0a3b2), where we compare data between two Snowflake tables within one database.

```
data-diff \
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \
Expand All @@ -130,22 +128,19 @@ In both code examples, I've used `<>` carrots to represent values that **should

We know that in some cases, the data-diff command can become long and dense. And maybe you're new to the command line.

* We're here to help [on slack](https://locallyoptimistic.slack.com/archives/C03HUNGQV0S) if you have ANY questions as you use `data-diff` in your workflow.
* We're here to help [on slack](https://getdbt.slack.com/archives/C03D25A92UU) if you have ANY questions as you use `data-diff` in your workflow.
* You can also post a question in [GitHub Discussions](https://github.com/datafold/data-diff/discussions).


To get a Slack invite - [click here](https://locallyoptimistic.com/community/)

## How to Use

* [How to use from the shell (or: command-line)](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line)
* [How to use from Python](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_python)
* [How to use with TOML configuration file](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_toml)
* [Usage Analytics & Data Privacy](https://docs.datafold.com/os_diff/usage_analytics_data_privacy)
* [Examples with dbt, joindiff, and hashdiff](https://docs.datafold.com/reference/open_source/cli#examples)
* [Examples with Python](https://data-diff.readthedocs.io/en/latest/python-api.html)
* [How to use with TOML configuration file](https://docs.datafold.com/reference/open_source/cli#toml-config-file)

## How to Contribute
* Feel free to open an issue or contribute to the project by working on an existing issue.
* Please read the [contributing guidelines](https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md) to get started.
* To add a new database driver, check out [docs](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst).

Big thanks to everyone who contributed so far:

Expand All @@ -155,7 +150,10 @@ Big thanks to everyone who contributed so far:

## Technical Explanation

Check out this [technical explanation](https://docs.datafold.com/os_diff/technical_explanation) of how data-diff works.
Check out this [technical explanation](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) of how data-diff works.

## Analytics
* [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)

## License

Expand Down
14 changes: 14 additions & 0 deletions docs/common_use_cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Common Use Cases

## joindiff
- **Inspect differences between branches**. Make sure your code results in only expected changes.
- **Validate stability of critical downstream tables**. When refactoring a data pipeline, rest assured that the changes you make to upstream models have not impacted critical downstream models depended on by users and systems.
- **Conduct better code reviews**. No matter how thoughtfully you review the code, run a diff to ensure that you don't accidentally approve an error.

## hashdiff
- **Verify data migrations**. Verify that all data was copied when doing a critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.
- **Verify data pipelines**. Moving data from a relational database to a warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.
- **Maintain data integrity SLOs**. You can create and monitor your SLO of e.g. 99.999% data integrity, and alert your team when data is missing.
- **Debug complex data pipelines**. Data can get lost in pipelines that may span a half-dozen systems. data-diff helps you efficiently track down where a row got lost without needing to individually inspect intermediate datastores.
- **Detect hard deletes for an `updated_at`-based pipeline**. If you're copying data to your warehouse based on an `updated_at`-style column, data-diff can find any hard-deletes that you may have missed.
- **Make your replication self-healing**. You can use data-diff to self-heal by using the diff output to write/update rows in the target database.
22 changes: 22 additions & 0 deletions docs/usage_analytics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Usage Analytics & Data Privacy

data-diff collects anonymous usage data to help our team improve the tool and to apply development efforts to where our users need them most.

We capture two events: one when the data-diff run starts, and one when it is finished. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to:

- Operating System and Python version
- Types of databases used (postgresql, mysql, etc.)
- Sizes of tables diffed, run time, and diff row count (numbers only)
- Error message, if any, truncated to the first 20 characters.
- A persistent UUID to indentify the session, stored in `~/.datadiff.toml`

To disable, use one of the following methods:

* **CLI**: use the `--no-tracking` flag.
* **Config file**: set `no_tracking = true` (for example, under `[run.default]`)
* **Python API**:
```python
import data_diff
# Invoke the following before making any API calls
data_diff.disable_tracking()
```