Skip to content

High-performance GPU implementations of Gaussian Blur and Sobel Edge Detection with CUDA, featuring optimized memory usage, benchmarking, and visualization of speedups over CPU.

kolossi101/gpu-image-processing

Repository files navigation

GPU Performance Optimization for Image Processing Algorithms

Overview

This project demonstrates high-performance implementations of two fundamental image processing algorithms—Gaussian Blur and Sobel Edge Detection—using both CPU and GPU (CUDA) approaches. It benchmarks and visualizes the performance impact of various GPU optimizations, including separable convolution and shared memory tiling.

  • Gaussian Blur: Used for noise reduction and image smoothing.
  • Sobel Edge Detection: Used for detecting edges and boundaries in images.

The project provides:

  • CPU and multiple GPU implementations (naive and optimized)
  • Automated benchmarking and CSV result output
  • Performance visualization scripts and plots

Table of Contents

  • Features
  • Project Structure
  • Setup Instructions
    • CUDA & OpenCV Installation
    • Building and Running
  • Usage
  • Performance Visualization
  • Results Summary
  • Analysis & Insights
  • Limitations & Future Work
  • References

Features

  • CPU Baseline: Sequential C++ implementations for both algorithms.
  • Naive GPU: Direct CUDA kernels with global memory access.
  • Optimized GPU:
    • Gaussian Blur: Separable convolution (reduces O(k²) to O(2k) operations).
    • Sobel Edge: Shared memory tiling to minimize redundant global memory reads.
  • Automated Benchmarking: Batch tests across kernel sizes and block dimensions.
  • CSV Output: Results saved for further analysis.
  • Plot Generation: Python script to visualize timing and speedup.

Project Structure

project-root/
│
├── build_and_run.ps1 # Script to build and run the program
├── generate-plots.py # Script to visualize performance data
│
├── data/ # CSV output files
├── images/ # Processed image outputs
├── plots/ # Generated performance plots
├── build/ # Build folder
├── CMakeLists.txt
├── filters_cpu.cpp
├── filters_cpu.h
├── filters_gpu.cu
├── filters_gpu.h
├── image_io.cpp
├── image_io.h
├── main.cpp
├── instructions.md
└── README.md

Setup Instructions

Setting Up CUDA Environment

  1. Create CUDA 12.4 Runtime Project named gpu_image_processing
  2. Add the above project files to the project.

Installing OpenCV on Windows (for C++/CMake)

1. Download OpenCV

  • Go to the official OpenCV Releases page.
  • Download the latest Windows pack (opencv-4.x.x-windows.exe).

    Example: opencv-4.12.0-windows.exe

2. Extract the Archive

  • Run the .exe — it will just unpack the files (not install anything).
  • Choose a location, e.g. C:\opencv.

After extraction, you’ll have a folder like:

C:\opencv\build
├── x64
│ └── vc15
│ ├── bin # DLLs
│ ├── lib # Libraries (.lib)
│ └── ...
└── include # Headers

3. Set Environment Variables

  1. Open System Properties → Advanced → Environment Variables.
  2. Add a new system variable:
    OPENCV_DIR = C:\opencv\build
  3. Edit your Path variable -> add: > C:\opencv\build\x64\vc16\bin (so Windows can find opencv_worldXXX.dll at runtime).

Building and Running

  1. Run the script
    Right-click on the build_and_run.ps1 file in File Explorer and select Run with PowerShell to start the program.

  2. Input prompts

    • Enter the image file name to process.
    • Select the processing device:
      • cpu
      • gpu
      • all
  3. Results

    • A printout of results will be displayed in the terminal.
    • Processed data files will be saved in the data/ folder as CSV files.
    • Processed images will be saved in the images/ folder.

Visualizing Performance

To visualize performance results:

  1. Run the script:
    python generate-plots.py
  2. Generated plots will be saved in the plots/ folder.

Results Summary

Gaussian Blur

  • Direct GPU (2D convolution): Up to 600x speedup over CPU for moderate kernel sizes (5x5, 7x7).
  • Separable GPU: Consistently low execution times and 2000x–3000x speedup for large kernels (15x15).
  • Direct approach loses efficiency for large kernels due to memory bandwidth limits.

Sobel Edge Detection

  • Global memory GPU: Up to 600x speedup for balanced block sizes.
  • Shared memory GPU: Further reduces execution time for small/skewed blocks, but can lose efficiency for large blocks due to occupancy and bank conflicts.
  • Block shape and memory access patterns significantly affect performance.

Analysis & Insights

  • Separable convolution for Gaussian blur is critical for large kernels, reducing complexity from O(k²) to O(2k).
  • Shared memory tiling in Sobel edge detection minimizes redundant global memory reads, but optimal block size and shape are crucial to avoid bank conflicts and occupancy loss.
  • GPU acceleration provides dramatic speedups for both algorithms, but careful memory and thread management is required for best results.

Limitations & Future Work

  • Hardware: Tested only on NVIDIA A10 GPU (Ampere, compute 8.6).
  • Scope: Focused on correct parallel implementation and tuning; did not explore multi-GPU or other architectures.
  • Complexity: CUDA programming challenges (e.g., shared memory, synchronization) limited further optimizations.

Future directions:

  • Test on other GPU architectures (e.g., Hopper, Ada Lovelace).
  • Implement higher-order or multidirectional Sobel filters.
  • Integrate CUDA kernels into deep learning pipelines (e.g., PyTorch custom ops).
  • Explore warp-level primitives and advanced memory prefetching.

References

  • Gonzalez, R. C., Woods, R. E. (2009). Digital Image Processing.
  • NVIDIA CUDA Best Practices Guide (2025).
  • Fisher, R. (2003). Gaussian Smoothing.
  • MoldStud (2025). Sobel Operator.
  • Harris, M. (2013). Optimizing Parallel Reduction in CUDA.
  • Li & Pang (2025). Medical Imaging Applications.
  • Podlozhnyuk, A. (2012). Image Convolution with CUDA.
  • Koutsantonis, D. (2021). CUDA Memory Optimization.
  • Additional references in project report.

Acknowledgements

This project was completed as part of DPS921 at Seneca Polytechnic.

For detailed methodology, results, and analysis, see the full project report (dps921-final-project-report-Nadiia-Geras.pdf).

About

High-performance GPU implementations of Gaussian Blur and Sobel Edge Detection with CUDA, featuring optimized memory usage, benchmarking, and visualization of speedups over CPU.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published