Skip to content

temporal-hpc/GPU-dynamic-parallelism

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Dynamic Parallelism

This program serves as an experimental tool for comparing the efficiency of the GPU under heterogeneous parallel workloads. It uses the Mandelbrot Set as case study, offering three GPU-based approaches:

  • Exhaustive approach: The classic flat-kernel parallel pass
  • CUDA Dynamic Parallelism: Nvidia's recursive kernel approach
  • Adaptive Serial Kernels: An alternative iterative-kernel approach (two variants proposed)

Compile

➜  GPU-dynamic-parallelism git:(master) make
nvcc -O3 -arch=sm_75 -rdc=true -lcudadevrt -Xcompiler -fopenmp -lpng src/main.cu -o bin/mandelbrot -Dno -Dno -DBSY=32 -DBSX=32 -DREPEATS=1
➜  GPU-dynamic-parallelism git:(master) 

See the Makefile for additional options such as printing progress info (VERBOSE) or exporting the subdivision grid image (GRIDLINES).

Run

➜  GPU-dynamic-parallelism git:(master) ✗ bin/gpuDP

Execute as ./bin/gpuDP <Approach> <W> <H> <rmin> <rmax> <cmin> <cmax> <CA_MAXDWELL> <B> <g> <r> <MAX_DEPTH> <filename>
   Approach:
               0 - Exhaustive (classic one-pass approach)
               1 - Dynamic Parallelism (from Nvidia)
               2 - Adaptive Serial Kernels (one thread-block per region)
               3 - Adaptive Serial Kernels (multiple thread-blocks per region)
   -----------------------------------------------------------------------------------------------
   Parameters                  Example         Info
   -----------------------------------------------------------------------------------------------
   W ------------------------- 1024            Width in pixels.
   H ------------------------- 1024            Height in pixels.
   rmin rmax ----------------- [-1.5, 0.5]     Real part range.
   cmin cmax ----------------- [-1.0, 1.0]     Complex part range.
   CA_MAXDWELL --------------- 512             Maximum numbers of iterarions per element.
   B ------------------------- 32              Region Size for stopping subdivision (powers of 2).
   g ------------------------- 32              Initial numbers of regions (powers of 2).
   r ------------------------- 4               Subdivision scheme (powers of 2).
   MAX_DEPTH ----------------- 5               Maximum recursion depth.
   filename  ----------------- none            Chosen filename, 'none' to skip file output.

Example:

➜  GPU-dynamic-parallelism git:(master) ✗ ./bin/gpuDP 2 $((2**10)) $((2**10)) -1.5 0.5 -1.0 1.0 512    1 2 2  1000 example
Grid 1024 x 1024 --> 0.00 GiB
[level  1] 0.000190 secs -->  P_{ 1} = 1.000000   (grid        1 x 2 x 2 =        4 --> 4 subdivided)
[level  2] 0.000200 secs -->  P_{ 2} = 0.875000   (grid        4 x 2 x 2 =       16 --> 14 subdivided)
[level  3] 0.000186 secs -->  P_{ 3} = 0.892857   (grid       14 x 2 x 2 =       56 --> 50 subdivided)
[level  4] 0.000360 secs -->  P_{ 4} = 0.850000   (grid       50 x 2 x 2 =      200 --> 170 subdivided)
[level  5] 0.000837 secs -->  P_{ 5} = 0.795588   (grid      170 x 2 x 2 =      680 --> 541 subdivided)
[level  6] 0.001945 secs -->  P_{ 6} = 0.776802   (grid      541 x 2 x 2 =     2164 --> 1681 subdivided)
[level  7] 0.004726 secs -->  P_{ 7} = 0.724569   (grid     1681 x 2 x 2 =     6724 --> 4872 subdivided)
[level  8] 0.013105 secs -->  P_{ 8} = 0.704023   (grid     4872 x 2 x 2 =    19488 --> 13720 subdivided)
<level  9> 0.036247 secs -->  P_{ 9} = 0.000000   (grid    13720 x 2 x 2 =    54880 --> 0 subdivided)
2, 32, 32, 1024, 1024, 512, 1000, 2, 1, 0.057932

Citation

@article{QUEZADA2023239,
title = {Modeling GPU Dynamic Parallelism for self similar density workloads},
journal = {Future Generation Computer Systems},
volume = {145},
pages = {239-253},
year = {2023},
issn = {0167-739X},
doi = {https://doi.org/10.1016/j.future.2023.03.046},
url = {https://www.sciencedirect.com/science/article/pii/S0167739X23001292},
author = {Felipe A. Quezada and Cristóbal A. Navarro and Miguel Romero and Cristhian Aguilera},
keywords = {GPU, Dynamic Parallelism, Subdivision, Heterogeneous workload, Kernel recursion overhead, Self similar density},
abstract = {Dynamic Parallelism (DP) is a GPU programming abstraction that can make parallel computation more efficient for problems that exhibit heterogeneous workloads. With DP, GPU threads can launch kernels with more threads, recursively, producing a subdivision effect where resources are focused on the regions that exhibit more parallel work. Doing an optimal subdivision process is not trivial, as the combination of different parameters play a relevant role in the final performance of DP. Also, the current programming abstraction of DP relies on kernel recursion, which has performance overhead. This work presents a new subdivision cost model for problems that exhibit self similar density (SSD) workloads, useful for finding efficient subdivision schemes. Also, a new subdivision implementation free of recursion overhead is presented, named Adaptive Serial Kernels (ASK). Using the Mandelbrot set as a case study, the cost model shows that optimal performance is achieved when using {g∼32,r∼2,B∼32} for the initial subdivision, recurrent subdivision and stopping size, respectively. Experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the ASK approach runs up to ∼60% faster than DP in the Mandelbrot set, and up to 12× faster than a basic exhaustive implementation, whereas DP is up to 7.5× faster. In terms of energy efficiency, ASK is up to ∼2× and ∼20× more energy efficient than DP and the exhaustive approach, respectively. These results put the subdivision cost model and the ASK approach as useful tools for analyzing the potential improvement of subdivision based approaches and for developing more efficient GPU-based libraries or fine-tune specific codes in research teams.}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •