Skip to content

An application to compare data from CSV files. Either to be used form the CLI or as importable library. Handles many edge cases, returns detailed feedback.

License

Notifications You must be signed in to change notification settings

rbuerki/compare_data_from_the_command_line

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Compare DataFrames

(Side project, Summer 2020, actual version: 0.3.0)

Intro

This application compares tabular data from two files that are loaded into Pandas DataFrames. It is especially helpful when a full comparison with Pandas' built-in df.equals(df) function is not possible or not usefull (e.g. because only subsets of the dataframes should be compared or because column names differ etc.). In such cases the app provides interactive step-by-step preprocessing and tries to handle many edge cases that I have encountered in my daily work.

The goal is to prepare the data such that a final boolean matching using Pandas' df.ne(df) function is as usefull as possible.

Features

Contrary to the project's title, the package can now be used either as:

  1. an importable library, for example to be used within a jupyter notebook
  2. a handy stand-alone CLI tool for super-quick data checks

The accepted input formats are:

  • Two CSV files
  • Two XLSX files
  • Two Pandas DataFrames

(Both have to be the same, you can not use different formats.)

Results are:

  • Standard-out process report and summary with the count of differing values per column (CLI and library versions)
  • Option to save a boolean dataframe to excel, indicating the exact locations of these differing values (CLI and library versions)
  • (Library version only) Return of 3 dataframes: The boolean 'df_diff' and the final states of the two processed input tables

Special features for processing are (same for both versions):

  • Possiblity to define specific load parameters for each file that will be passed to Pandas' read_csv or read_excel functions
  • Possiblity to enforce the same column names if these differ but the width of the 2 dataframes is the same
  • Handling of different shapes by finding matching subsets in the columns / indexes for the comparison
  • As far as possible: Handling of different dtypes as long as they are not of object type

Data prerequisites

  • Index values and index names of the two tables have to be consistent respective to the values they represent. Else a comparison is not possible. (One consequence: If for example new datapoints are added to a table, they have to be assigned to new index values while the existing index values are not allowed to change. Else they can not be compared with an earlier version of the table.)
  • If you pass a specific column name to be used as index (you can do this with the load_params, see example below), make sure it exists in both dataframes (if not, use the load_params to rename the columns). Also make sure the index columns have no duplicate values.

Usage

Command line Version

compare_df [options] [path_1] [path_2]

Available options are:

Prefix Description
-l_1, --load_params_1 Load params for file at path_1
-l_2, --load_params_2 Load params for file at path_2

Note: The optional load params have to be passed as single key-value-pairs in string format, each of them separatly for the respective dataframe. You can pass all the args that are accepted by pandas.read_csv or alternatively pandas.read_excel.

A full example could look as follows:

compare_df "data/file_manual.csv" "data/file_auto.csv" -l_1 "engine"="python" -l_1 "sep"=";" -l_1 "index_col"="customer_ID" -l_2 "encoding"="UTF-8" -l_2 "sep"=";" -l_2 "index_col"="customer_ID"

Library Version

>>> import compare_df
>>> df_diff, df_1, df_2 = compare_df.main(
    'path_1',
    'path_2',
    ['load_params_1'],
    ['load_params_2'],
)

Note: Contrary to the CLI version the optional load params are passed as dicts with key-value-pairs in string format. Again, you can pass all the args that are accepted by pandas.read_csv or alternatively pandas.read_excel.

A full example of calling the main() function could look as follows:

df_diff, df_1, df_2 = compare_df.main(
    "data/file_manual.csv",
    "data/file_auto.csv",
    load_params_1={"engine": "python", "sep": ";", "index_col": "customer_ID"},
    load_params_2={"encoding": "UTF-8", "sep": ";", "index_col": "customer_ID"},
)

Installation

Quick & dirty:

  • Clone or fork this repo to your machine
  • Activate a virtual envirenment of your choice (make sure you have pip installed)
  • Open a terminal, navigate to the repo's top-level folder (where setup.py is located)
  • Run the following command:
pip install .

Dependencies: Python >= 3.6, Pandas, xlsx_writer and openpyxl.

(Note: The last of them is used as default option for reading XLSX files. You could also pass another package in the load_params if desired.)

Aknowledgements / Resources

This project was essentially a little playground for experimenting with test driven development, working with a CLI and making a locally installable package (in development mode). The following resources got me started:

TODO

Possible new features for future versions:

  • Enable proper installation, add build / dist
  • Add a simple GUI (using a separate setuptools entry point)
  • Make an executable with PyInstaller (see here)

About

An application to compare data from CSV files. Either to be used form the CLI or as importable library. Handles many edge cases, returns detailed feedback.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published