Skip to content

Improve the performance of video compression using discrete Nvidia GPUs on standard x86 computers. Optimized memory usage with shared memory in CUDA. DCT/iDCT with ARM NEON to utilize SIMD. PCIe communication to send frames between computers.

Notifications You must be signed in to change notification settings

KjetilIN/codec63-discrete-gpu-opt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

278 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

codec63-gpu-opt

Implements transmission of frames from a client (Intel x86_64) machine to a server (Tegra ARM64 - Xavier) machine, and back, using a Dolphin PCIe Adapter (PXH810) and SCI API. The server completes motion estimation and compensation using CUDA for GPU acceleration. DCT/iDCT is optimized with SIMD instructions, and thread pool for computing DCT/iDCT rows.

Authors: Kjetil & Johannes

Features

  • Parallelized SAD computation for fast block matching.
  • Optimized memory usage with shared memory in CUDA.
  • Efficient reduction to determine the best motion vector.
  • DCT/iDCT with ARM NEON to utilize SIMD operations on CPU.
  • Thread pool for doing computing DCT and iDCT rows on multiple threads.
  • PCIe communication for transmitting raw input frames and encoded frames between machines.

Usage

Setup CUDA paths (for NVIDIA Tegra with CUDA 11.4):

PATH=$PATH:/usr/local/cuda-11.4/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.4/lib64:/lib

export PATH
export LD_LIBRARY_PATH

Setup CUDA paths (for Intel x86_64 with CUDA 12.8):

PATH=$PATH:/usr/local/cuda-12.8/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64:/lib

export PATH
export LD_LIBRARY_PATH

Ensure you have CUDA Toolkit installed on your system. You also need a .yuv file to encode.

  1. Compile with run script for Home Exam 3:
./run.sh --tegra tegra-[machine_number]

OR

  1. Compile with CMake:
mkdir build
cd build
cmake ..
make
  1. Encode with:
./c63enc <FILE_TO_BE_ENCODE> output.c63
  1. Decode with:
./c63dec output.c63 final.yuv
  1. Test Motion Estimation with:
./c63pred output.c63 pred.yuv

Run testing with bench.sh (from the build folder):

../../script/bench.sh 

Results

Motion Estimation and Compensation on descrete GPU

Optimization, (threads) Time (s) Time per frame (ms)
Baseline (Host only) 120.502 401.674
Opt_1: Naive (8x8) 28.791 95.968
Opt_2: Parallel (32x32) 22.254 74.179
Opt_3: Parallel w/c (32x32) 22.061 73.536
Opt_4: Parallel w/c (16x16) 21.182 70.606
Opt_5: Parallel w/c (8x8) 20.895 69.648

Table: Table of tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results. All tests were conducted on x86-1 (Intel Core i5-4590, Quadro K2200 GPU, 8 GB RAM, 5.0 compute capability).

DCT/iDCT on NVIDIA Tegra (SoC)

Optimization Time (s) Time/Frame (ms)
Baseline (Home Exam 1) 21.642 70.866
SoC Dram shared 24.631 82.103
Scale Vectorized (not pre) 24.496 81.654
Scale Vectorized (pre) 24.383 81.276
Quant divide 23.775 79.251
Quant multiply 23.826 79.421
Dequant divide 24.017 80.058
Dequant multiply 23.690 78.969
DCT_2D 19.362 64.541
iDCT_2D 15.875 52.917
Quantize DCT Threaded 13.266 44.220
Dequantize iDCT Threaded 8.769 29.229
Fixed core set 8.607 28.691
With 6 Cores 7.528 25.094
Dual FPU DCT_2D 7.629 25.431
Dual FPU iDCT_2D 7.278 24.261
Prefetch + Unroll iDCT/DCT 7.364 24.549
Prefetch + Unroll iDCT only 7.154 23.847
Thread Pool 6.532 21.773
FP16 DCT 2D 6.525 21.749
FP16 iDCT 2D 5.992 19.973
Memory Free and allocate Fix 4.574 15.249
Release 2.977 9.924

Table: Table of tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results. Computed on NVIDIA Tegra Xavier 32GB with 7.2 Compute ability and support of ARMv8.2 NEON (Tegra 3).

SISCI Dolphin PCIe

Optimizations Time (s) Time/Frame (ms)
Baseline (Home Exam 1) 21.642 70.866
Home Exam 2 2.977 9.924
Three frames in pipeline 16.64 55.4

Table: Table tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results.

Metrics Average (ms) std. (ms)
Encoder Waiting for Frames 25.33 45.94
Client Read to Written frame 82.78 40.25

Table: Table tested on foreman.yuv encoding all 300 frames.

Metrics Average (ms) std. (ms)
Encoder Waiting for Frames 0.733 19.196
Client Read to Written frame 404.828 38.476

Table: Table tested on tractor.yuv encoding all 690 frames.

About

Improve the performance of video compression using discrete Nvidia GPUs on standard x86 computers. Optimized memory usage with shared memory in CUDA. DCT/iDCT with ARM NEON to utilize SIMD. PCIe communication to send frames between computers.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •