Key features:
- Custom Key Column: Use
--key
to specify which column uniquely identifies rows. - Output Formats: Choose between human-readable (
readable
), JSON (json
), or save JSON to a file (jsonfile
). - Field Selection: Use
--fields
to compare only specific columns, or--ignorefields
to exclude columns. - Streaming Mode: For very large files, choose from multiple streaming modes (
--streaming=auto|normal|sortedkey|off
) to optimize performance. - Encoding: Specify file encoding with
--encoding
. - Show Unchanged: Use
--showunchanged
to display unchanged fields for changed rows. - List Fields: Use
--listfields
to print available columns and exit. - Performance Metrics: Automatically displays elapsed time for diff operations.
db-diff is the most advanced and user-friendly Python CLI tool and library for comparing CSV, TSV, and JSON database dumps. Designed specifically for database workflows, it delivers precise, human-readable or machine-readable diffs, supports custom key columns, handles massive files efficiently with streaming mode, and offers flexible field selection.
- Compare CSV, TSV, and JSON files for differences
- Human-readable and machine-readable (JSON) output
- JSON output to file with filename and path support
- Optional timestamp in output filenames
- Detects added, removed, and changed rows and columns
- Supports custom key columns for row identity
- Field inclusion/exclusion for focused diffs
- Streaming mode for very large files (memory efficient)
- Can output to terminal or JSON file
- Automatic delimiter and encoding detection
- Python 3.6+ compatible
pip install git+https://github.com/datsom1/db-diff.git
To upgrade to the latest version:
pip install --upgrade --force-reinstall git+https://github.com/datsom1/db-diff.git
- Python 3.6 or higher
- Dependencies are automatically installed with pip
Suppose you have two CSV files:
one.csv
Id,name,age
1,Cleo,4
2,Pancakes,2
two.csv
Id,name,age
1,Cleo,5
3,Bailey,1
Compare them using:
db-diff one.csv two.csv --key=Id
Sample output:
1 rows changed, 1 rows added, 1 rows removed
1 rows changed
Id: 1
age: "4" => "5"
1 rows added
Id: 3
name: Bailey
age: 1
1 rows removed
Id: 2
name: Pancakes
age: 2
db-diff
is a flexible CLI tool for comparing two data files (CSV, TSV, or JSON). It detects added, removed, and changed rows and columns, and can output results in a human-readable or machine-readable format.
Basic usage:
db-diff [OPTIONS] PREVIOUS CURRENT
PREVIOUS
andCURRENT
are the file paths to the two files you want to compare.- The tool auto-detects file format by extension, or you can specify with
--format
.
Key features:
- Custom Key Column: Use
--key
to specify which column uniquely identifies rows. - Output Formats: Choose between human-readable (
readable
), JSON (json
), or save JSON to a file (jsonfile
). - Field Selection: Use
--fields
to compare only specific columns, or--ignorefields
to exclude columns. - Streaming Mode Options: Choose between different streaming modes for CSV/TSV files:
--streaming=auto
(default): Automatically select the best mode based on file size and key ordering--streaming=normal
: Memory-efficient but slower processing--streaming=sortedkey
: Very fast streaming mode for files with sorted key columns--streaming=off
: Disable streaming regardless of file size
- Encoding: Specify file encoding with
--encoding
. - Show Unchanged: Use
--showunchanged
to display unchanged fields for changed rows. - List Fields: Use
--listfields
to print available columns and exit. - Timing: Use
--time
to display how long the diff operation took.
See all options:
db-diff --help
You can use db-diff
as a Python library for advanced or automated workflows. The library provides functions to load data, compare datasets, and render results.
from db_diff import load_csv, load_json
# Load CSV file, using a specific column as the key
with open("one.csv", encoding="utf-8") as f:
prev = load_csv(f, key="Id")
# Load JSON file, using a specific key
with open("two.json", encoding="utf-8") as f:
curr = load_json(f, key="Id")
from db_diff import compare
# Compare two datasets (dictionaries keyed by your chosen column)
diff = compare(prev, curr, show_unchanged=False)
show_unchanged
: If True, includes unchanged fields for changed rows.fields
: Pass a set of field names to only compare those fields.ignorefields
: Pass a set of field names to ignore during comparison.
# For normal streaming mode (memory efficient)
from db_diff import streaming_compare_csv
diff = streaming_compare_csv(
"one.csv",
"two.csv",
key="Id",
compare_columns={"Id", "name", "age"},
encoding="utf-8",
dialect="excel"
)
# For sorted key streaming mode (faster, but requires sorted keys)
from db_diff import streaming_compare_csv_sorted
diff = streaming_compare_csv_sorted(
"one.csv",
"two.csv",
key="Id",
compare_columns={"Id", "name", "age"},
encoding="utf-8",
dialect="excel"
)
# Check if a file's key column is sorted
from db_diff import is_key_column_sorted
is_sorted = is_key_column_sorted(
"one.csv",
key="Id",
encoding="utf-8",
dialect="excel"
)
The result of compare
or streaming_compare_csv
is a dictionary:
{
"added": [ ... ], # List of added rows (dicts)
"removed": [ ... ], # List of removed rows (dicts)
"changed": [ # List of changed rows
{
"key": "row_id",
"changes": {
"field1": ["old", "new"],
...
},
"unchanged": { ... } # (optional) if show_unchanged=True
},
...
],
"columns_added": [ ... ], # List of columns added
"columns_removed": [ ... ] # List of columns removed
}
from db_diff import human_text
print(human_text(diff, key="Id", current=curr))
from db_diff import load_csv, compare, human_text
with open("one.csv") as f1, open("two.csv") as f2:
prev = load_csv(f1, key="Id")
curr = load_csv(f2, key="Id")
diff = compare(prev, curr, show_unchanged=True)
print(human_text(diff, key="Id", current=curr))
See all available options with:
db-diff --help
A summary of key options:
Option | Description |
---|---|
--key TEXT |
Column to use as a unique ID for each row (default: first column header) |
--output TEXT |
Output format: readable , json , or jsonfile (default: readable) |
--outfilename FILE |
File to write JSON output to (used with --output=jsonfile ) |
--outfilepath DIR |
Directory to save the output file (used with --output=jsonfile ) |
--timestamp |
Add timestamp to filename (YYYY-MM-DD_HH-MM-SS_diffs_.json) |
--fields TEXT |
Comma-separated list of fields to compare (all others ignored) |
--ignorefields TEXT |
Comma-separated list of fields to ignore during comparison |
--showunchanged |
Show all fields for changed records, not just changed fields |
--format TEXT |
Explicitly specify input format: csv , tsv , or json (default: auto-detect) |
--encoding TEXT |
Input file encoding (default: utf-8) |
--streaming=MODE |
Streaming mode for CSV/TSV files. Options: auto (default), normal , sortedkey , off |
--listfields |
List available fields/columns in the input files and exit |
--verbose |
Enable verbose output with detailed logging |
--samplesize NUM |
Number of rows to sample when checking if key columns are sorted (default: 1000) |
--version |
Show the version and exit |
-h, --help |
Show help message and exit |
Show unchanged fields for changed rows:
db-diff one.csv two.csv --key=Id --showunchanged
Output as JSON:
db-diff one.csv two.csv --key=Id --output=json
Save JSON output to a file:
db-diff one.csv two.csv --key=Id --output=jsonfile --outfilename=diffs.json
Save JSON output with timestamp in filename:
db-diff one.csv two.csv --key=Id --output=jsonfile --timestamp
Compare only specific fields:
db-diff one.csv two.csv --key=Id --fields=Id,name
Ignore specific fields:
db-diff one.csv two.csv --key=Id --ignorefields=LastModifiedDate
Streaming mode options for large files:
# Auto mode (default) - automatically selects the best streaming mode
db-diff large1.csv large2.csv --key=Id --streaming=auto
# Normal streaming mode - memory efficient but slower
db-diff large1.csv large2.csv --key=Id --streaming=normal
# Sorted key streaming mode - very fast for files with sorted keys
db-diff large1.csv large2.csv --key=Id --streaming=sortedkey
# Disable streaming regardless of file size
db-diff large1.csv large2.csv --key=Id --streaming=off
# Adjust the sample size for key column sorting detection
db-diff large1.csv large2.csv --key=Id --samplesize=5000
The streaming modes work with CSV/TSV files only. The auto
mode will automatically detect if the key columns are sorted and select the most efficient streaming mode for your data.
Verbose output with detailed logging:
db-diff one.csv two.csv --key=Id --verbose
- Choose the appropriate streaming mode for your data:
auto
(default): Automatically selects the best mode based on file propertiessortedkey
: Fastest streaming mode, but requires key columns to be sorted in both filesnormal
: Memory-efficient but slower, good for very large unsorted filesoff
: Use when you want to process the entire files in memory regardless of size
- The tool automatically detects if key columns are sorted and selects the most efficient mode
- For very large files, the sorted key streaming mode can be 2-5x faster than normal streaming
- Streaming modes read files in chunks, so they use significantly less memory than loading entire files
Author: Thomas Coyle
Repository: https://github.com/datsom1/db-diff