This repository contains an implementation of the K-Means clustering algorithm, leveraging both Python and C for optimized performance. This was built during university studies in a C & Python data analysis course.
- Python Implementation: Handles data processing and initialization using K-Means++.
- C Extension: Optimized clustering computation using linked lists for efficient memory management.
kmeans_pp.py- Python implementation, including K-Means++ initialization.kmeansmodule.c- C extension implementing the core clustering logic.setup.py- Build script for compiling the C extension into a Python module.
- Python 3.x installed.
- A C compiler such as
gcc.
Run the following command to compile the C module:
python setup.py build_ext --inplaceThis will generate a shared library (mykmeanssp.*.so or .pyd on Windows) that can be imported into Python.
python kmeans_pp.py <k> [<max_iter>] <epsilon> <input_file_1> <input_file_2><k>: Number of clusters (integer > 1 and < N, where N is the number of points).<max_iter>: (Optional) Maximum number of iterations (default: 300, max: 1000).<epsilon>: Convergence threshold (float >= 0).<input_file_1>,<input_file_2>: CSV files containing the input data.
python kmeans_pp.py 3 300 0.001 data1.csv data2.csv- The program prints the initial centroids' indices.
- The final cluster centroids are printed, with each centroid on a new line formatted to 4 decimal places.
- Input files must be in CSV format with numerical values.
- The implementation uses the K-Means++ initialization method for better convergence.
- The Python script interfaces with the optimized C extension for better performance.
This project is released under the MIT License.