Skip to content
/ csv-diff Public
forked from simonw/csv-diff

Python CLI tool and library for diffing CSV and JSON files

License

Notifications You must be signed in to change notification settings

yaiol/csv-diff

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

csv-diff

PyPI Changelog Tests License

Tool for viewing the difference between two CSV, TSV or JSON files. See Generating a commit log for San Francisco’s official list of trees (and the sf-tree-history repo commit log) for background information on this project.

Installation

pip install csv-diff

Usage

Consider two CSV files:

one.csv

id,name,age
1,Cleo,4
2,Pancakes,2

two.csv

id,name,age
1,Cleo,5
3,Bailey,1

csv-diff can show a human-readable summary of differences between the files:

$ csv-diff one.csv two.csv --key=id
1 row changed, 1 row added, 1 row removed

1 row changed

  Row 1
    age: "4" => "5"

1 row added

  id: 3
  name: Bailey
  age: 1

1 row removed

  id: 2
  name: Pancakes
  age: 2

The --key=id option means that the id column should be treated as the unique key, to identify which records have changed. To use a combination of columns as the key, separate them with a comma, e.g., --key=id1,id2.

The --ignore=col option means that the col column will be ignored during the comparison. To ignore multiple columns, separate them with a comma, e.g., --ignore=col1,col2.

The tool will automatically detect if your files are comma- or tab-separated. You can over-ride this automatic detection and force the tool to use a specific format using --iformat=tsv or --iformat=csv.

You can also feed it JSON files, provided they are a JSON array of objects where each object has the same keys. Use --format=json if your input files are JSON.

Use --show-unchanged to include full details of the unchanged values for rows with at least one change in the diff output:

% csv-diff one.csv two.csv --key=id --show-unchanged
1 row changed

  id: 1
    age: "4" => "5"

    Unchanged:
      name: "Cleo"

TSV output

You can use the --oformat tsv option to get a Tab-separated difference:

$ csv-diff one.csv two.csv --key=id --oformat json
Action	Type	Key	Field	Previous	Current
Modified	Summary		rows	1
Modified	Row	1	id
Modified	Field	1	age	4	5
Removed	Summary		rows	1
Removed	Row	2	id
Removed	Field	2	id	2
Removed	Field	2	name	Pancakes
Removed	Field	2	age	2
Added	Summary		rows	1
Added	Row	3	id
Added	Field	3	id	3
Added	Field	3	name	Bailey
Added	Field	3	age	1

XLSX output

You can use the --oformat xlsx option to create a xlsx file

$ csv-diff one.csv two.csv --key=id --oformat xlsx -o diff.xlsx

XLSX Modified XLSX Removed XLSX Added

JSON output

You can use the --oformat json option to get a machine-readable difference:

$ csv-diff one.csv two.csv --key=id --oformat json
{
  "Modified": [
    {
      "Key": "1",
      "Fields": {
        "age": [
          "4",
          "5"
        ]
      }
    }
  ],
  "Added": [
    {
      "Key": "3",
      "Fields": {
        "id": "3",
        "name": "Bailey",
        "age": "1"
      }
    }
  ],
  "Removed": [
    {
      "Key": "2",
      "Fields": {
        "id": "2",
        "name": "Pancakes",
        "age": "2"
      }
    }
  ],
  "Columns Added": [],
  "Columns Removed": []
}

Adding templated extras

You can specify additional keys to be displayed in the human-readable format using the --extra option:

--extra name "Python format string with {id} for variables"

For example, to output a link to https://news.ycombinator.com/latest?id={id} for each item with an ID, you could use this:

csv-diff one.csv two.csv --key=id \
  --extra latest "https://news.ycombinator.com/latest?id={id}"

These extras display something like this:

1 row changed

  id: 41459472
    points: "24" => "25"
    numComments: "5" => "6"
  extras:
    latest: https://news.ycombinator.com/latest?id=41459472

As a Python library

You can also import the Python library into your own code like so:

from csv_diff import load_csv, compare
diff = compare(
    load_csv(open("one.csv"), key="id"),
    load_csv(open("two.csv"), key="id")
)

diff will now contain the same data structure as the output in the --json example above.

If the columns in the CSV have changed, those added or removed columns will be ignored when calculating changes made to specific rows.

As a Docker container

Build the image

$ docker build -t csvdiff .

Run the container

$ docker run --rm -v $(pwd):/files csvdiff

Suppose current directory contains two csv files : one.csv two.csv

$ docker run --rm -v $(pwd):/files csvdiff one.csv two.csv

Alternatives

  • csvdiff is a "fast diff tool for comparing CSV files" - you may get better results from this than from csv-diff against larger files.

About

Python CLI tool and library for diffing CSV and JSON files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Dockerfile 0.3%