This project focuses on accelerating a neural network implementation for the MNIST classification task using GPU programming with CUDA. We begin with a sequential CPU implementation (V1) and progressively optimize it to maximize performance on the GPU (V4). The key goal is to gain hands-on experience in parallel computing, high-performance computing (HPC), and CUDA optimizations.
├── src
│ ├── V1 # Baseline sequential implementation
│ ├── V2 # Naive GPU implementation
│ ├── V3 # Optimized GPU implementation with performance improvements
│ ├── V4 # Optimized GPU implementation utilizing tensor cores
│ ├── V5 # Optimized GPU implementation using OpenACC
├── data # Contains the MNIST dataset
├── report # Project report
├── slides # Presentation slides
├── README.md # Project documentation and instructions
- NVIDIA GPU with CUDA support
- CUDA Toolkit installed
nvcccompiler availablemakeutility installed
Navigate to the src directory and run:
makeThis will compile the project and generate an executable located in build/nn.exe.
To execute the program, run:
make runThis will execute the compiled neural network and move profiling data if available.
To run the profiling version:
make prof-run
make nsight-analyze
make speedupThis generates profiling data for performance analysis.
To remove all compiled files and reset the build directory:
make cleanmain.cu: Entry point for the neural network execution.neural_net.cu: Core implementation of the neural network.utils.cu: Utility functions for matrix operations and timers.mnist.cu: MNIST dataset handling functions.nn.h: Header file defining neural network parameters.utils.h: Header file defining helper functions for matrix operations and timing.speedup_analysis.c: Compares all versions and gives speedup analysis.
Each version of the project applies different optimization techniques:
- Sequential execution on CPU.
- No parallelism or GPU acceleration.
- Converts matrix operations to CUDA kernels.
- Parallel execution but lacks optimizations.
- Optimized kernel launch configuration.
- Improved occupancy and memory usage.
- Reduced communication overhead.
- Efficient memory hierarchy utilization.
- Utilized Cuda Streams
- Utilized Pinned Memory
- Initialization shifted to kernel side
- Combined multiple small kernels
- Utilized Shared Memory
- Used Optimized Compiler Flags
- Utilizes Tensor Cores for matrix multiplications.
- Further speedup through specialized CUDA libraries.
- Directive-based parallelism.
- Quick porting, hardware abstraction.
- Umer Farooq
- Muhammad Irtaza Khan