Implementation of a top-down discretization algorithm with a greedy criterion, which maximizes the number of object pairs with different labels (classes) separated by the split.
The algorithm performs the following steps:
- Data loading: Input CSV file.
- Continuous attribute analysis:
- Generate all possible split points (averages between unique values).
- Calculate the number of object pairs from different classes that are separated for each split point.
- Select the split point with the maximum "separation gain."
- Recursion: Repeat the process in subintervals until the desired number of intervals (
n_bins
) is reached. - Save the result: Create a new CSV file with discretized values.
distance,label
3.75,near
9.51,far
...
python main.py
2025-05-03 15:52:44 - INFO - Selected split 3.95 with gain 12
2025-05-03 15:52:44 - INFO - Selected split 7.8 with gain 12
...
📂 Project
├── main.py # Main file to run the algorithm
├── utils.py # Helper functions: logging, I/O, timing
├── test_data.csv # Example small dataset
├── test_data_large.csv # Generated larger dataset for testing
├── diskretized_test_data.csv # Algorithm output
-
Place your CSV file in the project folder.
-
Ensure the
distance
andlabel
columns exist (or adjust them inmain.py
). -
Run the algorithm:
python main.py
-
You can change the number of intervals (
n_bins
) in thediscretization_alg()
function.
Project created for educational purposes. You are free to modify and use it. 🌱