Skip to content

lebinary/parallel-kmeans-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

K-means Clustering

Parallel GPU implementations of K-means clustering using CUDA and Thrust.

Build

make clean && make

Run

Example commands for different datasets:

2k dataset (n=2048, d=16, k=16):

./bin/kmeans -v 1 -k 16 -d 16 -i input/random-n2048-d16-c16.txt -m 150 -t 1e-5 -s 8675309

16k dataset (n=16384, d=24, k=16):

./bin/kmeans -v 1 -k 16 -d 24 -i input/random-n16384-d24-c16.txt -m 150 -t 1e-5 -s 8675309

64k dataset (n=65536, d=32, k=16):

./bin/kmeans -v 1 -k 16 -d 32 -i input/random-n65536-d32-c16.txt -m 150 -t 1e-5 -s 8675309

Arguments:

  • -v: version (0=CPU, 1=CUDA Basic, 2=CUDA Shared, 3=Thrust)
  • -k: number of clusters
  • -d: dimension of points
  • -i: input file
  • -m: max iterations
  • -t: convergence threshold
  • -s: random seed
  • -c: output centroids (optional, default: output assignments)

Testing and Analysis

Run correctness tests:

./run_correctness_test.sh [version]
# Example: ./run_correctness_test.sh 1  # Test CUDA Basic
# Or test all: ./run_correctness_test.sh 0 1 2 3

Run performance tests:

./run_performance_test.sh

Generate all graphs:

./generate_all_graphs.sh

Or generate individual graphs:

python3 graph_speedup.py
python3 graph_efficiency.py
python3 graph_data_transfer.py

Generate report PDF:

./generate_pdf.sh

Results

Speedup Plot

Performance Summary (64k dataset):

  • CUDA Basic: 13.4x speedup, 63.8% memory-efficient, 0.29% compute-efficient
  • CUDA Shared: 13.1x speedup, 62.5% memory-efficient, 0.28% compute-efficient
  • Thrust: 5.7x speedup, 27.2% memory-efficient, 0.12% compute-efficient

Key Findings:

  • Workload is memory-bound (not compute-bound)
  • CUDA Basic outperforms Shared Memory (no benefit from caching small centroids)
  • Data transfer overhead is negligible (<0.04% of runtime)

About

Implementing k-means clustering algorithm from scratch with CUDA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published