codec63-gpu-opt

Implements transmission of frames from a client (Intel x86_64) machine to a server (Tegra ARM64 - Xavier) machine, and back, using a Dolphin PCIe Adapter (PXH810) and SCI API. The server completes motion estimation and compensation using CUDA for GPU acceleration. DCT/iDCT is optimized with SIMD instructions, and thread pool for computing DCT/iDCT rows.

Authors: Kjetil & Johannes

Features

Parallelized SAD computation for fast block matching.
Optimized memory usage with shared memory in CUDA.
Efficient reduction to determine the best motion vector.
DCT/iDCT with ARM NEON to utilize SIMD operations on CPU.
Thread pool for doing computing DCT and iDCT rows on multiple threads.
PCIe communication for transmitting raw input frames and encoded frames between machines.

Usage

Setup CUDA paths (for NVIDIA Tegra with CUDA 11.4):

PATH=$PATH:/usr/local/cuda-11.4/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.4/lib64:/lib

export PATH
export LD_LIBRARY_PATH

Setup CUDA paths (for Intel x86_64 with CUDA 12.8):

PATH=$PATH:/usr/local/cuda-12.8/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64:/lib

export PATH
export LD_LIBRARY_PATH

Ensure you have CUDA Toolkit installed on your system. You also need a .yuv file to encode.

Compile with run script for Home Exam 3:

./run.sh --tegra tegra-[machine_number]

OR

Compile with CMake:

mkdir build
cd build
cmake ..
make

Encode with:

./c63enc <FILE_TO_BE_ENCODE> output.c63

Decode with:

./c63dec output.c63 final.yuv

Test Motion Estimation with:

./c63pred output.c63 pred.yuv

Run testing with bench.sh (from the build folder):

../../script/bench.sh

Results

Motion Estimation and Compensation on descrete GPU

Optimization, (threads)	Time (s)	Time per frame (ms)
Baseline (Host only)	120.502	401.674
Opt_1: Naive (8x8)	28.791	95.968
Opt_2: Parallel (32x32)	22.254	74.179
Opt_3: Parallel w/c (32x32)	22.061	73.536
Opt_4: Parallel w/c (16x16)	21.182	70.606
Opt_5: Parallel w/c (8x8)	20.895	69.648

Table: Table of tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results. All tests were conducted on x86-1 (Intel Core i5-4590, Quadro K2200 GPU, 8 GB RAM, 5.0 compute capability).

DCT/iDCT on NVIDIA Tegra (SoC)

Optimization	Time (s)	Time/Frame (ms)
Baseline (Home Exam 1)	21.642	70.866
SoC Dram shared	24.631	82.103
Scale Vectorized (not pre)	24.496	81.654
Scale Vectorized (pre)	24.383	81.276
Quant divide	23.775	79.251
Quant multiply	23.826	79.421
Dequant divide	24.017	80.058
Dequant multiply	23.690	78.969
DCT_2D	19.362	64.541
iDCT_2D	15.875	52.917
Quantize DCT Threaded	13.266	44.220
Dequantize iDCT Threaded	8.769	29.229
Fixed core set	8.607	28.691
With 6 Cores	7.528	25.094
Dual FPU DCT_2D	7.629	25.431
Dual FPU iDCT_2D	7.278	24.261
Prefetch + Unroll iDCT/DCT	7.364	24.549
Prefetch + Unroll iDCT only	7.154	23.847
Thread Pool	6.532	21.773
FP16 DCT 2D	6.525	21.749
FP16 iDCT 2D	5.992	19.973
Memory Free and allocate Fix	4.574	15.249
Release	2.977	9.924

Table: Table of tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results. Computed on NVIDIA Tegra Xavier 32GB with 7.2 Compute ability and support of ARMv8.2 NEON (Tegra 3).

SISCI Dolphin PCIe

Optimizations	Time (s)	Time/Frame (ms)
Baseline (Home Exam 1)	21.642	70.866
Home Exam 2	2.977	9.924
Three frames in pipeline	16.64	55.4

Table: Table tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results.

Metrics	Average (ms)	std. (ms)
Encoder Waiting for Frames	25.33	45.94
Client Read to Written frame	82.78	40.25

Table: Table tested on foreman.yuv encoding all 300 frames.

Metrics	Average (ms)	std. (ms)
Encoder Waiting for Frames	0.733	19.196
Client Read to Written frame	404.828	38.476

Table: Table tested on tractor.yuv encoding all 690 frames.

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
assets		assets
docs		docs
script		script
src		src
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

codec63-gpu-opt

Features

Usage

Results

Motion Estimation and Compensation on descrete GPU

DCT/iDCT on NVIDIA Tegra (SoC)

SISCI Dolphin PCIe

About

Uh oh!

Contributors 2

Uh oh!

Languages

KjetilIN/codec63-discrete-gpu-opt

Folders and files

Latest commit

History

Repository files navigation

codec63-gpu-opt

Features

Usage

Results

Motion Estimation and Compensation on descrete GPU

DCT/iDCT on NVIDIA Tegra (SoC)

SISCI Dolphin PCIe

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages