Skip to content
This repository was archived by the owner on Feb 3, 2025. It is now read-only.

Commit 5c3f085

Browse files
authored
Merge pull request #2
Feature 1.0
2 parents ccd67d8 + 2279d12 commit 5c3f085

16 files changed

+338
-610
lines changed

README.md

Lines changed: 100 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -1,126 +1,137 @@
11

2-
# Database Analysis
2+
# Database Analysis Toolkit
33

4-
## Overview
5-
6-
Database-Analysis is a Python Jupyter notebook designed to ensure data integrity by identifying inconsistencies between two flat files. One of these files serves as the database for an corporation/organization . The tool logs any inconsistencies found, facilitating easy identification and correction of data issues. In addition, it provides some analytics that may be of use.
4+
The **Database Analysis Toolkit** is a Python-based tool designed to perform comprehensive data analysis on large datasets, currently focusing on geospatial analysis and fuzzy matching. The toolkit supports various data formats and provides configurable options to tailor the analysis to specific needs. This use-case is particularly useful for data engineers, data scientists, and analysts working with large datasets and looking to perform advanced data processing tasks.
75

86
## Features
97

10-
- **Data Integrity Checks**: Compares two flat files and logs inconsistencies.
11-
- **Detailed Logging**: Generates a comprehensive log of all inconsistencies found.
12-
- **User-Friendly Interface**: Easy-to-use Jupyter notebook interface.
13-
- **Customizable**: Easily adaptable for different data formats and validation rules.
8+
- **Geospatial Analysis**: Calculate distances between geographical coordinates using the Haversine formula and identify clusters within a specified threshold.
9+
- **Fuzzy Matching**: Identify and group similar records within a dataset based on configurable matching criteria.
10+
- **Support for Multiple File Formats**: Easily load and process data from CSV, Excel, JSON, Parquet, and Feather files.
11+
- **Customizable**: Configurable through a YAML file or command-line arguments, allowing users to adjust the analysis according to their needs.
1412

15-
## Installation
13+
## Project Structure
1614

17-
Follow these steps to set up your environment and run the Jupyter notebook:
15+
```plaintext
16+
.
17+
├── config/
18+
│ └── config.yaml # Configuration file for the analysis
19+
├── data/
20+
│ └── input_file.csv # Input data files (CSV, Excel, JSON, Parquet, Feather)
21+
├── env/
22+
│ ├── linux/environment.yml # Conda environment file for Linux
23+
│ └── win/environment.yml # Conda environment file for Windows
24+
├── logs/
25+
│ └── logfile.log # Log file storing all logging information
26+
├── modules/
27+
│ ├── data_loader.py # Module for loading data from various formats
28+
│ ├── fuzzy_matching.py # Module for performing fuzzy matching
29+
│ └── geospatial_analysis.py # Module for performing geospatial analysis
30+
├── results/
31+
│ └── output_file.csv # Output files generated by the analysis
32+
├── util/
33+
│ └── util.py # Utility functions for saving files and other tasks
34+
├── database-analysis.py # Main script to run the analysis
35+
└── README.md # Project documentation
36+
```
37+
38+
## Installation
1839

1940
### Prerequisites
2041

21-
- Python 3.7 or later
22-
- Jupyter Notebook
23-
- Virtual Environment (recommended)
42+
- **Conda**: Ensure you have Conda installed. You can install it from [here](https://docs.conda.io/en/latest/miniconda.html).
43+
- **Python 3.11 or later**: The project is compatible with Python 3.11 and above.
2444

2545
### Setting Up the Environment
2646

27-
1. **Clone the Repository**
28-
29-
```bash
30-
git clone https://github.com/umarhunter/database-analysis.git
31-
```
32-
33-
2. **Enter the Repo**
34-
```bash
35-
cd database-analysis
36-
```
37-
38-
3. **Create a Virtual Environment**
39-
40-
It's recommended to use a virtual environment to manage dependencies.
41-
42-
```bash
43-
python3 -m venv env
44-
```
45-
46-
4. **Activate the Virtual Environment**
47-
48-
- On Windows:
49-
50-
```bash
51-
.\env\Scripts\activate
52-
```
53-
54-
- On macOS and Linux:
55-
56-
```bash
57-
source env/bin/activate
58-
```
59-
60-
5. **Install Dependencies**
61-
62-
```bash
63-
pip install -r requirements.txt
64-
```
47+
To create the Conda environment with all necessary dependencies, use the following command:
6548

49+
```bash
50+
conda env create -f environment.yml
51+
```
6652

67-
### Setting Up Jupyter
68-
69-
1. **Install Jupyter**
70-
71-
If you don't already have Jupyter installed, you can install it using pip:
72-
73-
```bash
74-
pip install notebook
75-
```
53+
Activate the environment:
7654

77-
2. **Start Jupyter Notebook**
55+
```bash
56+
conda activate database-analysis-env
57+
```
7858

79-
Navigate to the project directory and start Jupyter Notebook:
59+
### Manual Installation
8060

81-
```bash
82-
jupyter notebook
83-
```
61+
If you prefer to install the dependencies manually or without Conda, you can install them using `pip`:
8462

85-
3. **Open the Notebook**
63+
```bash
64+
pip install pandas rapidfuzz haversine pyyaml
65+
```
8666

87-
In the Jupyter interface, open `database-analysis.ipynb`.
67+
## Configuration
68+
69+
The toolkit uses a YAML configuration file (`config/config.yaml`) to define various parameters for the analysis, such as:
70+
71+
- **Input and Output Files**: Specify paths for input data and output results.
72+
- **Analysis Options**: Enable or disable geospatial analysis and fuzzy matching.
73+
- **Sorting and Thresholds**: Define columns for sorting and thresholds for matching.
74+
75+
### Example Configuration
76+
77+
Here’s a sample `config.yaml` file:
78+
79+
```yaml
80+
input_file: "data/input.csv"
81+
output_file: "results/output.csv"
82+
sort_by_columns:
83+
- "first_name"
84+
- "last_name"
85+
geospatial_analysis: True
86+
geospatial_columns:
87+
- "latitude"
88+
- "longitude"
89+
geospatial_threshold: 0.005
90+
fuzzy_matching: True
91+
fuzzy_columns:
92+
- "address"
93+
fuzzy_threshold: 0.8
94+
```
8895
8996
## Usage
9097
91-
1. **Prepare Your Files**
92-
93-
Ensure you have the two flat files ready. One file should be the reference database, and the other should be the data you want to compare against the database. Sample files have already been provided on your behalf (credit: ```generatedata.com```).
98+
### Running the Analysis
9499
95-
2. **Run the Notebook**
100+
To perform the analysis using the configuration file:
96101
97-
Follow the instructions within the notebook to load your files and execute the data consistency checks.
102+
```bash
103+
python database-analysis.py --config config/config.yaml
104+
```
98105

99-
3. **Review the Logs**
106+
You can also override specific configurations using command-line arguments:
100107

101-
The notebook will output a log file detailing any inconsistencies found between the two files. Review this log to identify and correct data issues.
108+
```bash
109+
python database-analysis.py --input_file data/input.csv --output_file results/output.csv --geospatial_analysis True --fuzzy_matching True
110+
```
102111

103-
## Project Structure
112+
### Supported File Formats
104113

105-
```
106-
database-analysis/
107-
108-
├── database-analysis.ipynb # Main Jupyter notebook
109-
├── requirements.txt # Project dependencies
110-
├── data/ # Directory to store your flat files
111-
│ ├── database.csv # Example reference file
112-
│ └── target.csv # Example target file
113-
└── logs/ # Directory to store log files
114-
```
114+
- CSV (`.csv`)
115+
- Excel (`.xlsx`)
115116

116-
## Author
117+
### Logging
117118

118-
This project is created and maintained by @umarhunter.
119+
All logging information is saved in the `logs/logfile.log` file. The log file includes details about data loading, the execution of geospatial analysis, fuzzy matching, and any errors encountered during processing.
119120

120121
## Contributing
121122

122-
Contributions are welcome! Please fork the repository and create a pull request with your changes. I'll gladly look at errors and suggestions.
123+
We welcome contributions to the Database Analysis Toolkit! If you would like to contribute:
124+
125+
1. Fork the repository.
126+
2. Create a new branch (`git checkout -b feature/YourFeature`).
127+
3. Make your changes and commit them (`git commit -m 'Add some feature'`).
128+
4. Push to the branch (`git push origin feature/YourFeature`).
129+
5. Open a Pull Request.
123130

124131
## License
125132

126-
This project is licensed under the GNU License - see the [LICENSE](LICENSE) file for details.
133+
This project is licensed under the GNU License. See the `LICENSE` file for more details.
134+
135+
## Acknowledgements
136+
137+
This toolkit leverages Python libraries such as `pandas`, `rapidfuzz`, and `haversine` to perform data analysis. We thank the open-source community for their continuous support and contributions.

config/config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
input_file: "data/database.csv"
2-
output_file: "results/results.csv"
1+
input_file_name: "database.csv"
2+
output_file_name: "results.csv"
33
sort_by_columns: ["first_name", "last_name"]
44
geospatial_analysis: "True"
55
geospatial_columns: ["lat", "lon"]

data/database.csv

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
guid,altid,first_name,last_name,address,zip,region,lat,lon
22
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,1D265437-1314-4EA7-2B32-26D30A495809,Daryl,Valenzuela,650-7555 Pharetra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376778,-169.5627133
3-
082789E2-9D3A-DA84-2086-DEDEC83F46C2,2167E809-4F99-EA36-A306-177CA6B8CBD5,Daryl,Valenzuela,650-7555 Phatra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376772,-169.5627130
4-
75261691-7A59-E6E3-D45F-D71DCDE9B64E,CEE49EDD-5337-EA8D-E5D9-695EA794675E,Daryl,Valenzuela,650-7555 Phatra,665818,Brussels Hoofdstedelijk Gewest,41.57376769,-169.5627131
5-
6F2AE798-B7B5-43C2-43D3-592B1109FA23,2BA41BA5-BD91-2C6E-BDBA-7AEBEBCA4A8B,Daryl,Valenzuela,650-7555 Phatra Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376768,-169.5627136
3+
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,2167E809-4F99-EA36-A306-177CA6B8CBD5,Daryl,Valenzuela,650-7555 Phatra. Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376772,-169.5627130
4+
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,CEE49EDD-5337-EA8D-E5D9-695EA794675E,Daryl,Valenzuela,650-7555 Phatra,665818,Brussels Hoofdstedelijk Gewest,41.57376769,-169.5627131
5+
986FC58A-3B13-7DEE-ECDA-95E2388AE5FD,2BA41BA5-BD91-2C6E-BDBA-7AEBEBCA4A8B,Daryl,Valenzuela,650-7555 Phatra Ave,665818,Brussels Hoofdstedelijk Gewest,41.57376768,-169.5627136
66
C1DFC1C5-E1AE-6C3C-B896-6AC906D78B75,B88D594D-D82B-FF96-6A13-BE3F5EA8C09E,Harrison,Bradshaw,"P.O. Box 672, 3567 Lorem, St.",13301,La Libertad,-37.1363712,-164.7661257
77
B0F8617E-B92E-3D28-6C45-D33CAC621FF3,CE71E9BE-2D11-74A7-9F53-DD6A7C4A2188,Caesar,Matthews,9348 Ultricies Rd.,76358,Limburg,-63.38889011,-140.5019733
88
422A904A-65C4-B8E0-5D98-B4A4E3C15DC0,3CC12277-AE43-EC0A-709D-ECFD34C12C92,Erica,Barker,Ap #169-7969 Commodo Ave,641319,Munster,14.90508831,77.67420406

0 commit comments

Comments
 (0)