This OpenCL model of the bitonic sorting algorithm is specifically optimized to be synthesized by SDAccel targeting Xilinx FPGAs. The original source code had been written by NVidia for its GPUs. Our optimizations improved the performance of the original source code, when run on an FPGA, by several orders of magnitude.
This software contains source code provided by NVIDIA Corporation. The original OpenCL source is distributed with the NVIDIA OpenCL examples repository. Extensive modifications and optimizations were performed by Mehdi Roozmeh, of Politecnico di Torino, Italy, in order to generate high performance RTL.
Sorting vectors is a fundamental algorithm used in a variety of high-performance data center applications. Among various possible version of sorting solutions, the bitonic algorithm is one of the fastest sorting networks. In general the term sorting network identifies a sorting algorithm where the sequence of comparisons is not data-dependent, thus making it suitable for hardware implementation. The bitonic sorting network consists of comparators, which are the basic block of any sorting network and sort a pair of values presents on inputs.
The figure shows a simple sorting network with five comparators and four inputs. Each comparator puts the higher value on the bottom output and the lower value on the top output. Two comparators on the left hand side and two in the middle can work in parallel, requiring three time steps.
The depth and number of comparators is a key parameter to evaluate the performance of a sorting network. The maximum number of comparators along any path is the depth of the sorting network. Assuming that all the comparisons at each level are done in parallel, the depth of the sorting network is equal to the number of stages and thus proportional to the total execution time. The bitonic merge sort network is one of the fastest comparison sorting networks, where the following formulas representing the depth and number of comparators:
D(N)= (log2 N.(log2 N+1)) / 2 ----> Depth of sorting network
C(N)= (N.log 2 N (log2N+1)) / 4 ----> Number of Comparator
The following figure illustrates a Bitonic Merge sort network with eight inputs (N=8). It operates in 3 stages, it has a depth of 6 steps and employs 24 comparators.
![sorting_network] (https://github.com/mediroozmeh/Bitonic-Sorting/blob/master/Figures/SORTINGNETWOR.jpg)
Divide and conquer is the principle of the merge sort algorithm. It is based on the notion of bitonic sequence, i.e. a sequence of N elements in which the first K elements are sorted in ascending order, and the last N-K elements are sorted in descending order (i.e. the K-th element acts as a divider between two sub-lists, each sorted in a different direction), or some circular shift of such an order.
Bitonic sort first divides the input into pairs and then sorts each pair into a bitonic sequence. It then merges (and sorts) two adjacent bitonic sequences, and repeats this process through all stages until the entire sequence is stored.
sdaccel.tcl : This tcl file is used to run software simulation, hardware emulation and synthesize the source code. Furthermore, synthesis constraints such as the maximum memory ports and the number of compute units for each kernel, are added to the design using this tcl file. Set to 1 the __debug__flag in this file to compile and run the kernels with gdb. Please note that this application does not work with SDAccel 2016.2 (it hangs when executing the clCreateProgramWithBinary call while performing HW emulation).
BitonicSort.cl : This file includes the three kernels which model bitonic sorting. The original code included a fourth kernel, to be used for small input arrays, which has not been optimized since it is not significant for performance.
hostcode.cpp: This file provides inputs to the kernels, executes them in the right sequence, and reads back the outputs. It also checks the correctness of the output.
param.h : This header file is shared between both host and FPGA code, and defines some parameters such as the sizes of global and local arrays, as described below. It contains two different parameter sets: one for a small data set, which compiles and simulates in under 5 minutes, and one for a large data set, which compiles in 15 minutes, and completes RTL simulation in about 4 hours. A larger work group size of course achieves better performance at the cost of more FPGA resources.
Key Parameters of the Bitonic Sorting Algorithm :
Parameter | Default Value | Description |
---|---|---|
DATA_SIZE | 4096 (128) | Total number of keys |
LOCAL_SIZE_LIMIT | 256 (32) | Twice the work group size |
DEBUG | undefined | If defined, print verbose output |
Off-chip memory access can be a serious bottleneck in any OpenCl application. SDAccel uses the async_work_group_copy OpenCL function to copy global to local memory and vice-versa using AXI bursts, thus improving the overall performance by taking advantage of the full bit width of the DDR3 interfaces.
SDAccel enables designers to take advantage of the parallel programming model of OpenCL by instantiating multiple work groups of same kernel separately and executing them in parallel. The FPGA parallel architecture can be exploited by this technique, due to improved overall memory bandwidth utilization and better coarse-grained level parallelism.
The use of specific high-level synthesis directives is necessary in order to achieve optimized RTL. In this OpenCl example we used unrolling, pipelining and memory partitioning in order to generate optimized RTL from essentially the same source code which is executed on the GPU.
SDAccel enables users to generate multiple RTL solutions from the same source code. We also executed the OpenCL code on two different GPU devices (GeForce GTX 960 and Quadro K4200). We present performance and energy consumption results. All the data refer to 4096 elements to be sorted and work group size 128 (i.e. LOCAL_SIZE_LIMIT 256).
Parameters/Devices | Virtex7 | GTX960 | K4200 |
---|---|---|---|
Total time (ms) | 17 | 16 | 25 |
Power(W) | 11 | 120 | 108 |
Energy(mj) | 187 | 1920 | 2700 |
LUT Utilization | 166740 (38 %) | - | - |
FF Utilization | 137210 (15 %) | - | - |
DSP Utilization | 160 (4.4 %) | - | - |
BRAMs Utilization | 1300 (44 %) | - | - |
Parameters/Devices | Virtex 7 | GTX960 | K4200 |
---|---|---|---|
Memory Bandwidth (GB/sec) | 34 | 112 | 173 |
Graphics Card Power (W) | - | 120 | 108 |
CUDA CORES | - | 1024 | 1344 |
#References [1] http://www.xilinx.com/support/documentation/sw_manuals/ug1207-sdaccel-performance-optimization.pdf
[2] Vukasin Rankovic, Anton Kos,"Performance of the Bitonic MergeSort Network on Dataflow Computer", Serbia, Belgrade, 2013
[3] http://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.pdf
[5] https://www.micron.com/products/datasheets/f3a0d8c5-ce92-4641-b59a-6c0fb1826645
[6] http://www.xilinx.com/support/documentation/white_papers/wp375_HPC_Using_FPGAs.pdf
[7] http://www.xilinx.com/support/documentation/ip_documentation/ug586_7Series_MIS.pdf
[8] Qi Mu. Liqing Cui. Yufei Son, "The implementation and optimization of Bitonic sort algorithm based on CUDA" https://arxiv.org/pdf/1506.01446v1.pdf
[9] http://www.xilinx.com/support/documentation/white_papers/wp375_HPC_Using_FPGAs.pdf