Skip to content
This repository was archived by the owner on Feb 3, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 100 additions & 89 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,126 +1,137 @@

# Database Analysis
# Database Analysis Toolkit

## Overview

Database-Analysis is a Python Jupyter notebook designed to ensure data integrity by identifying inconsistencies between two flat files. One of these files serves as the database for an corporation/organization . The tool logs any inconsistencies found, facilitating easy identification and correction of data issues. In addition, it provides some analytics that may be of use.
The **Database Analysis Toolkit** is a Python-based tool designed to perform comprehensive data analysis on large datasets, currently focusing on geospatial analysis and fuzzy matching. The toolkit supports various data formats and provides configurable options to tailor the analysis to specific needs. This use-case is particularly useful for data engineers, data scientists, and analysts working with large datasets and looking to perform advanced data processing tasks.

## Features

- **Data Integrity Checks**: Compares two flat files and logs inconsistencies.
- **Detailed Logging**: Generates a comprehensive log of all inconsistencies found.
- **User-Friendly Interface**: Easy-to-use Jupyter notebook interface.
- **Customizable**: Easily adaptable for different data formats and validation rules.
- **Geospatial Analysis**: Calculate distances between geographical coordinates using the Haversine formula and identify clusters within a specified threshold.
- **Fuzzy Matching**: Identify and group similar records within a dataset based on configurable matching criteria.
- **Support for Multiple File Formats**: Easily load and process data from CSV, Excel, JSON, Parquet, and Feather files.
- **Customizable**: Configurable through a YAML file or command-line arguments, allowing users to adjust the analysis according to their needs.

## Installation
## Project Structure

Follow these steps to set up your environment and run the Jupyter notebook:
```plaintext
.
├── config/
│ └── config.yaml # Configuration file for the analysis
├── data/
│ └── input_file.csv # Input data files (CSV, Excel, JSON, Parquet, Feather)
├── env/
│ ├── linux/environment.yml # Conda environment file for Linux
│ └── win/environment.yml # Conda environment file for Windows
├── logs/
│ └── logfile.log # Log file storing all logging information
├── modules/
│ ├── data_loader.py # Module for loading data from various formats
│ ├── fuzzy_matching.py # Module for performing fuzzy matching
│ └── geospatial_analysis.py # Module for performing geospatial analysis
├── results/
│ └── output_file.csv # Output files generated by the analysis
├── util/
│ └── util.py # Utility functions for saving files and other tasks
├── database-analysis.py # Main script to run the analysis
└── README.md # Project documentation
```

## Installation

### Prerequisites

- Python 3.7 or later
- Jupyter Notebook
- Virtual Environment (recommended)
- **Conda**: Ensure you have Conda installed. You can install it from [here](https://docs.conda.io/en/latest/miniconda.html).
- **Python 3.11 or later**: The project is compatible with Python 3.11 and above.

### Setting Up the Environment

1. **Clone the Repository**

```bash
git clone https://github.com/umarhunter/database-analysis.git
```

2. **Enter the Repo**
```bash
cd database-analysis
```

3. **Create a Virtual Environment**

It's recommended to use a virtual environment to manage dependencies.

```bash
python3 -m venv env
```

4. **Activate the Virtual Environment**

- On Windows:

```bash
.\env\Scripts\activate
```

- On macOS and Linux:

```bash
source env/bin/activate
```

5. **Install Dependencies**

```bash
pip install -r requirements.txt
```
To create the Conda environment with all necessary dependencies, use the following command:

```bash
conda env create -f environment.yml
```

### Setting Up Jupyter

1. **Install Jupyter**

If you don't already have Jupyter installed, you can install it using pip:

```bash
pip install notebook
```
Activate the environment:

2. **Start Jupyter Notebook**
```bash
conda activate database-analysis-env
```

Navigate to the project directory and start Jupyter Notebook:
### Manual Installation

```bash
jupyter notebook
```
If you prefer to install the dependencies manually or without Conda, you can install them using `pip`:

3. **Open the Notebook**
```bash
pip install pandas rapidfuzz haversine pyyaml
```

In the Jupyter interface, open `database-analysis.ipynb`.
## Configuration

The toolkit uses a YAML configuration file (`config/config.yaml`) to define various parameters for the analysis, such as:

- **Input and Output Files**: Specify paths for input data and output results.
- **Analysis Options**: Enable or disable geospatial analysis and fuzzy matching.
- **Sorting and Thresholds**: Define columns for sorting and thresholds for matching.

### Example Configuration

Here’s a sample `config.yaml` file:

```yaml
input_file: "data/input.csv"
output_file: "results/output.csv"
sort_by_columns:
- "first_name"
- "last_name"
geospatial_analysis: True
geospatial_columns:
- "latitude"
- "longitude"
geospatial_threshold: 0.005
fuzzy_matching: True
fuzzy_columns:
- "address"
fuzzy_threshold: 0.8
```

## Usage

1. **Prepare Your Files**

Ensure you have the two flat files ready. One file should be the reference database, and the other should be the data you want to compare against the database. Sample files have already been provided on your behalf (credit: ```generatedata.com```).
### Running the Analysis

2. **Run the Notebook**
To perform the analysis using the configuration file:

Follow the instructions within the notebook to load your files and execute the data consistency checks.
```bash
python database-analysis.py --config config/config.yaml
```

3. **Review the Logs**
You can also override specific configurations using command-line arguments:

The notebook will output a log file detailing any inconsistencies found between the two files. Review this log to identify and correct data issues.
```bash
python database-analysis.py --input_file data/input.csv --output_file results/output.csv --geospatial_analysis True --fuzzy_matching True
```

## Project Structure
### Supported File Formats

```
database-analysis/
├── database-analysis.ipynb # Main Jupyter notebook
├── requirements.txt # Project dependencies
├── data/ # Directory to store your flat files
│ ├── database.csv # Example reference file
│ └── target.csv # Example target file
└── logs/ # Directory to store log files
```
- CSV (`.csv`)
- Excel (`.xlsx`)

## Author
### Logging

This project is created and maintained by @umarhunter.
All logging information is saved in the `logs/logfile.log` file. The log file includes details about data loading, the execution of geospatial analysis, fuzzy matching, and any errors encountered during processing.

## Contributing

Contributions are welcome! Please fork the repository and create a pull request with your changes. I'll gladly look at errors and suggestions.
We welcome contributions to the Database Analysis Toolkit! If you would like to contribute:

1. Fork the repository.
2. Create a new branch (`git checkout -b feature/YourFeature`).
3. Make your changes and commit them (`git commit -m 'Add some feature'`).
4. Push to the branch (`git push origin feature/YourFeature`).
5. Open a Pull Request.

## License

This project is licensed under the GNU License - see the [LICENSE](LICENSE) file for details.
This project is licensed under the GNU License. See the `LICENSE` file for more details.

## Acknowledgements

This toolkit leverages Python libraries such as `pandas`, `rapidfuzz`, and `haversine` to perform data analysis. We thank the open-source community for their continuous support and contributions.
4 changes: 2 additions & 2 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
input_file: "data/database.csv"
output_file: "results/results.csv"
input_file_name: "database.csv"
output_file_name: "results.csv"
sort_by_columns: ["first_name", "last_name"]
geospatial_analysis: "True"
geospatial_columns: ["lat", "lon"]
Expand Down
6 changes: 3 additions & 3 deletions data/database.csv
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
guid,altid,first_name,last_name,address,zip,region,lat,lon
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,1D265437-1314-4EA7-2B32-26D30A495809,Daryl,Valenzuela,650-7555 Pharetra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376778,-169.5627133
082789E2-9D3A-DA84-2086-DEDEC83F46C2,2167E809-4F99-EA36-A306-177CA6B8CBD5,Daryl,Valenzuela,650-7555 Phatra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376772,-169.5627130
75261691-7A59-E6E3-D45F-D71DCDE9B64E,CEE49EDD-5337-EA8D-E5D9-695EA794675E,Daryl,Valenzuela,650-7555 Phatra,665818,Brussels Hoofdstedelijk Gewest,41.57376769,-169.5627131
6F2AE798-B7B5-43C2-43D3-592B1109FA23,2BA41BA5-BD91-2C6E-BDBA-7AEBEBCA4A8B,Daryl,Valenzuela,650-7555 Phatra Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376768,-169.5627136
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,2167E809-4F99-EA36-A306-177CA6B8CBD5,Daryl,Valenzuela,650-7555 Phatra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376772,-169.5627130
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,CEE49EDD-5337-EA8D-E5D9-695EA794675E,Daryl,Valenzuela,650-7555 Phatra,665818,Brussels Hoofdstedelijk Gewest,41.57376769,-169.5627131
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,2BA41BA5-BD91-2C6E-BDBA-7AEBEBCA4A8B,Daryl,Valenzuela,650-7555 Phatra Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376768,-169.5627136
C1DFC1C5-E1AE-6C3C-B896-6AC906D78B75,B88D594D-D82B-FF96-6A13-BE3F5EA8C09E,Harrison,Bradshaw,"P.O. Box 672, 3567 Lorem, St.",13301,La Libertad,-37.1363712,-164.7661257
B0F8617E-B92E-3D28-6C45-D33CAC621FF3,CE71E9BE-2D11-74A7-9F53-DD6A7C4A2188,Caesar,Matthews,9348 Ultricies Rd.,76358,Limburg,-63.38889011,-140.5019733
422A904A-65C4-B8E0-5D98-B4A4E3C15DC0,3CC12277-AE43-EC0A-709D-ECFD34C12C92,Erica,Barker,Ap #169-7969 Commodo Ave,641319,Munster,14.90508831,77.67420406
Expand Down
Loading