Parallel GPU implementations of K-means clustering using CUDA and Thrust.
make clean && makeExample commands for different datasets:
2k dataset (n=2048, d=16, k=16):
./bin/kmeans -v 1 -k 16 -d 16 -i input/random-n2048-d16-c16.txt -m 150 -t 1e-5 -s 867530916k dataset (n=16384, d=24, k=16):
./bin/kmeans -v 1 -k 16 -d 24 -i input/random-n16384-d24-c16.txt -m 150 -t 1e-5 -s 867530964k dataset (n=65536, d=32, k=16):
./bin/kmeans -v 1 -k 16 -d 32 -i input/random-n65536-d32-c16.txt -m 150 -t 1e-5 -s 8675309Arguments:
-v: version (0=CPU, 1=CUDA Basic, 2=CUDA Shared, 3=Thrust)-k: number of clusters-d: dimension of points-i: input file-m: max iterations-t: convergence threshold-s: random seed-c: output centroids (optional, default: output assignments)
Run correctness tests:
./run_correctness_test.sh [version]
# Example: ./run_correctness_test.sh 1 # Test CUDA Basic
# Or test all: ./run_correctness_test.sh 0 1 2 3Run performance tests:
./run_performance_test.shGenerate all graphs:
./generate_all_graphs.shOr generate individual graphs:
python3 graph_speedup.py
python3 graph_efficiency.py
python3 graph_data_transfer.pyGenerate report PDF:
./generate_pdf.shPerformance Summary (64k dataset):
- CUDA Basic: 13.4x speedup, 63.8% memory-efficient, 0.29% compute-efficient
- CUDA Shared: 13.1x speedup, 62.5% memory-efficient, 0.28% compute-efficient
- Thrust: 5.7x speedup, 27.2% memory-efficient, 0.12% compute-efficient
Key Findings:
- Workload is memory-bound (not compute-bound)
- CUDA Basic outperforms Shared Memory (no benefit from caching small centroids)
- Data transfer overhead is negligible (<0.04% of runtime)
