A Python-based utility to compare two CSV files and find differences between them based on user-specified columns. This tool is ideal for identifying mismatched or missing rows in datasets.
- Cleans column data by stripping whitespace and removing newlines.
- Allows users to specify columns to compare.
- Displays differences in rows between two CSV files.
- Saves the differences into a new CSV file for easy access.
- Python 3.x
- pandas library
To install the required dependencies:
pip install -r requirements.txt
git clone <repository_url>
cd dataset-diff-finder
python main.py
- Enter the path to the first CSV file (e.g.,
path/to/data-set-1.csv
). - Enter the path to the second CSV file (e.g.,
path/to/data-set-2.csv
). - View the available columns in both files and specify the columns to compare (comma-separated, e.g.,
state,city
). - Enter the path to save the differences CSV file (e.g.,
path/to/diffs.csv
).
data-set-1.csv:
state,city,pop
NY,NYC,8.8
CA,LA,3.8
IL,CHI,2.5
data-set-2.csv:
state,city,pop
NY,NYC,8.8
CA,LA,3.8
Enter the path to the first CSV file: path/to/data-set-1.csv
Enter the path to the second CSV file: path/to/data-set-2.csv
Columns in File 1: ['state', 'city', 'pop']
Columns in File 2: ['state', 'city', 'pop']
Enter the column names to compare (comma-separated): state,city
Enter the path to save the differences CSV file: path/to/diffs.csv
diffs.csv:
state,city,pop
IL,CHI,2.5
dataset-diff-finder/
├── .venv/ # Virtual environment (optional)
├── tests/ # Directory for test files
├── venv/ # Virtual environment (if used)
├── .gitignore # Git ignored files
├── main.py # Main Python script
├── README.md # Project documentation
├── requirements.txt # Python dependencies
- Ensure that both CSV files have the same structure (matching columns) for accurate comparisons.
- Only rows that differ between the two files will be saved in the output file.