umarhunter · umarhunter · Aug 18, 2024 · Aug 18, 2024 · Aug 18, 2024 · Aug 18, 2024
diff --git a/README.md b/README.md
@@ -1,126 +1,137 @@
 
-# Database Analysis
+# Database Analysis Toolkit
 
-## Overview
-
-Database-Analysis is a Python Jupyter notebook designed to ensure data integrity by identifying inconsistencies between two flat files. One of these files serves as the database for an corporation/organization . The tool logs any inconsistencies found, facilitating easy identification and correction of data issues. In addition, it provides some analytics that may be of use.
+The **Database Analysis Toolkit** is a Python-based tool designed to perform comprehensive data analysis on large datasets, currently focusing on geospatial analysis and fuzzy matching. The toolkit supports various data formats and provides configurable options to tailor the analysis to specific needs. This use-case is particularly useful for data engineers, data scientists, and analysts working with large datasets and looking to perform advanced data processing tasks.
 
 ## Features
 
-- **Data Integrity Checks**: Compares two flat files and logs inconsistencies.
-- **Detailed Logging**: Generates a comprehensive log of all inconsistencies found.
-- **User-Friendly Interface**: Easy-to-use Jupyter notebook interface.
-- **Customizable**: Easily adaptable for different data formats and validation rules.
+- **Geospatial Analysis**: Calculate distances between geographical coordinates using the Haversine formula and identify clusters within a specified threshold.
+- **Fuzzy Matching**: Identify and group similar records within a dataset based on configurable matching criteria.
+- **Support for Multiple File Formats**: Easily load and process data from CSV, Excel, JSON, Parquet, and Feather files.
+- **Customizable**: Configurable through a YAML file or command-line arguments, allowing users to adjust the analysis according to their needs.
 
-## Installation
+## Project Structure
 
-Follow these steps to set up your environment and run the Jupyter notebook:
+```plaintext
+.
+├── config/
+│   └── config.yaml               # Configuration file for the analysis
+├── data/
+│   └── input_file.csv            # Input data files (CSV, Excel, JSON, Parquet, Feather)
+├── env/
+│   ├── linux/environment.yml     # Conda environment file for Linux
+│   └── win/environment.yml       # Conda environment file for Windows
+├── logs/
+│   └── logfile.log               # Log file storing all logging information
+├── modules/
+│   ├── data_loader.py            # Module for loading data from various formats
+│   ├── fuzzy_matching.py         # Module for performing fuzzy matching
+│   └── geospatial_analysis.py    # Module for performing geospatial analysis
+├── results/
+│   └── output_file.csv           # Output files generated by the analysis
+├── util/
+│   └── util.py                   # Utility functions for saving files and other tasks
+├── database-analysis.py          # Main script to run the analysis
+└── README.md                     # Project documentation
+```
+
+## Installation
 
 ### Prerequisites
 
-- Python 3.7 or later
-- Jupyter Notebook
-- Virtual Environment (recommended)
+- **Conda**: Ensure you have Conda installed. You can install it from [here](https://docs.conda.io/en/latest/miniconda.html).
+- **Python 3.11 or later**: The project is compatible with Python 3.11 and above.
 
 ### Setting Up the Environment
 
-1. **Clone the Repository**
-
-    ```bash
-    git clone https://github.com/umarhunter/database-analysis.git
-    ```
-
-2. **Enter the Repo**
-   ```bash
-   cd database-analysis
-   ```
-
-3. **Create a Virtual Environment**
-
-    It's recommended to use a virtual environment to manage dependencies. 
-
-    ```bash
-    python3 -m venv env
-    ```
-
-4. **Activate the Virtual Environment**
-
-    - On Windows:
-
-        ```bash
-        .\env\Scripts\activate
-        ```
-
-    - On macOS and Linux:
-
-        ```bash
-        source env/bin/activate
-        ```
-
-5. **Install Dependencies**
-
-    ```bash
-    pip install -r requirements.txt
-    ```
+To create the Conda environment with all necessary dependencies, use the following command:
 
+```bash
+conda env create -f environment.yml
+```
 
-### Setting Up Jupyter
-
-1. **Install Jupyter**
-
-    If you don't already have Jupyter installed, you can install it using pip:
-
-    ```bash
-    pip install notebook
-    ```
+Activate the environment:
 
-2. **Start Jupyter Notebook**
+```bash
+conda activate database-analysis-env
+```
 
-    Navigate to the project directory and start Jupyter Notebook:
+### Manual Installation
 
-    ```bash
-    jupyter notebook
-    ```
+If you prefer to install the dependencies manually or without Conda, you can install them using `pip`:
 
-3. **Open the Notebook**
+```bash
+pip install pandas rapidfuzz haversine pyyaml
+```
 
-    In the Jupyter interface, open `database-analysis.ipynb`.
+## Configuration
+
+The toolkit uses a YAML configuration file (`config/config.yaml`) to define various parameters for the analysis, such as:
+
+- **Input and Output Files**: Specify paths for input data and output results.
+- **Analysis Options**: Enable or disable geospatial analysis and fuzzy matching.
+- **Sorting and Thresholds**: Define columns for sorting and thresholds for matching.
+
+### Example Configuration
+
+Here’s a sample `config.yaml` file:
+
+```yaml
+input_file: "data/input.csv"
+output_file: "results/output.csv"
+sort_by_columns:
+  - "first_name"
+  - "last_name"
+geospatial_analysis: True
+geospatial_columns:
+  - "latitude"
+  - "longitude"
+geospatial_threshold: 0.005
+fuzzy_matching: True
+fuzzy_columns:
+  - "address"
+fuzzy_threshold: 0.8
+```
 
 ## Usage
 
-1. **Prepare Your Files**
-
-    Ensure you have the two flat files ready. One file should be the reference database, and the other should be the data you want to compare against the database. Sample files have already been provided on your behalf (credit: ```generatedata.com```).
+### Running the Analysis
 
-2. **Run the Notebook**
+To perform the analysis using the configuration file:
 
-    Follow the instructions within the notebook to load your files and execute the data consistency checks.
+```bash
+python database-analysis.py --config config/config.yaml
+```
 
-3. **Review the Logs**
+You can also override specific configurations using command-line arguments:
 
-    The notebook will output a log file detailing any inconsistencies found between the two files. Review this log to identify and correct data issues.
+```bash
+python database-analysis.py --input_file data/input.csv --output_file results/output.csv --geospatial_analysis True --fuzzy_matching True
+```
 
-## Project Structure
+### Supported File Formats
 
-```
-database-analysis/
-│
-├── database-analysis.ipynb  # Main Jupyter notebook
-├── requirements.txt                   # Project dependencies
-├── data/                              # Directory to store your flat files
-│   ├── database.csv                   # Example reference file
-│   └── target.csv                     # Example target file
-└── logs/                              # Directory to store log files
-```
+- CSV (`.csv`)
+- Excel (`.xlsx`)
 
-## Author
+### Logging
 
-This project is created and maintained by @umarhunter.
+All logging information is saved in the `logs/logfile.log` file. The log file includes details about data loading, the execution of geospatial analysis, fuzzy matching, and any errors encountered during processing.
 
 ## Contributing
 
-Contributions are welcome! Please fork the repository and create a pull request with your changes. I'll gladly look at errors and suggestions.
+We welcome contributions to the Database Analysis Toolkit! If you would like to contribute:
+
+1. Fork the repository.
+2. Create a new branch (`git checkout -b feature/YourFeature`).
+3. Make your changes and commit them (`git commit -m 'Add some feature'`).
+4. Push to the branch (`git push origin feature/YourFeature`).
+5. Open a Pull Request.
 
 ## License
 
-This project is licensed under the GNU License - see the [LICENSE](LICENSE) file for details.
+This project is licensed under the GNU License. See the `LICENSE` file for more details.
+
+## Acknowledgements
+
+This toolkit leverages Python libraries such as `pandas`, `rapidfuzz`, and `haversine` to perform data analysis. We thank the open-source community for their continuous support and contributions.
diff --git a/config/config.yaml b/config/config.yaml
@@ -1,5 +1,5 @@
-input_file: "data/database.csv"
-output_file: "results/results.csv"
+input_file_name: "database.csv"
+output_file_name: "results.csv"
 sort_by_columns: ["first_name", "last_name"]
 geospatial_analysis: "True"
 geospatial_columns: ["lat", "lon"]

diff --git a/data/database.csv b/data/database.csv
@@ -1,8 +1,8 @@
 guid,altid,first_name,last_name,address,zip,region,lat,lon
 986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,1D265437-1314-4EA7-2B32-26D30A495809,Daryl,Valenzuela,650-7555 Pharetra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376778,-169.5627133
-082789E2-9D3A-DA84-2086-DEDEC83F46C2,2167E809-4F99-EA36-A306-177CA6B8CBD5,Daryl,Valenzuela,650-7555 Phatra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376772,-169.5627130
-75261691-7A59-E6E3-D45F-D71DCDE9B64E,CEE49EDD-5337-EA8D-E5D9-695EA794675E,Daryl,Valenzuela,650-7555 Phatra,665818,Brussels Hoofdstedelijk Gewest,41.57376769,-169.5627131
-6F2AE798-B7B5-43C2-43D3-592B1109FA23,2BA41BA5-BD91-2C6E-BDBA-7AEBEBCA4A8B,Daryl,Valenzuela,650-7555 Phatra Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376768,-169.5627136
+986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,2167E809-4F99-EA36-A306-177CA6B8CBD5,Daryl,Valenzuela,650-7555 Phatra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376772,-169.5627130
+986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,CEE49EDD-5337-EA8D-E5D9-695EA794675E,Daryl,Valenzuela,650-7555 Phatra,665818,Brussels Hoofdstedelijk Gewest,41.57376769,-169.5627131
+986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,2BA41BA5-BD91-2C6E-BDBA-7AEBEBCA4A8B,Daryl,Valenzuela,650-7555 Phatra Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376768,-169.5627136
 C1DFC1C5-E1AE-6C3C-B896-6AC906D78B75,B88D594D-D82B-FF96-6A13-BE3F5EA8C09E,Harrison,Bradshaw,"P.O. Box 672, 3567 Lorem, St.",13301,La Libertad,-37.1363712,-164.7661257
 B0F8617E-B92E-3D28-6C45-D33CAC621FF3,CE71E9BE-2D11-74A7-9F53-DD6A7C4A2188,Caesar,Matthews,9348 Ultricies Rd.,76358,Limburg,-63.38889011,-140.5019733
 422A904A-65C4-B8E0-5D98-B4A4E3C15DC0,3CC12277-AE43-EC0A-709D-ECFD34C12C92,Erica,Barker,Ap #169-7969 Commodo Ave,641319,Munster,14.90508831,77.67420406