Implements transmission of frames from a client (Intel x86_64) machine to a server (Tegra ARM64 - Xavier) machine, and back, using a Dolphin PCIe Adapter (PXH810) and SCI API. The server completes motion estimation and compensation using CUDA for GPU acceleration. DCT/iDCT is optimized with SIMD instructions, and thread pool for computing DCT/iDCT rows.
Authors: Kjetil & Johannes
- Parallelized SAD computation for fast block matching.
- Optimized memory usage with shared memory in CUDA.
- Efficient reduction to determine the best motion vector.
- DCT/iDCT with ARM NEON to utilize SIMD operations on CPU.
- Thread pool for doing computing DCT and iDCT rows on multiple threads.
- PCIe communication for transmitting raw input frames and encoded frames between machines.
Setup CUDA paths (for NVIDIA Tegra with CUDA 11.4):
PATH=$PATH:/usr/local/cuda-11.4/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.4/lib64:/lib
export PATH
export LD_LIBRARY_PATH
Setup CUDA paths (for Intel x86_64 with CUDA 12.8):
PATH=$PATH:/usr/local/cuda-12.8/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64:/lib
export PATH
export LD_LIBRARY_PATH
Ensure you have CUDA Toolkit installed on your system. You also need a .yuv file to encode.
- Compile with run script for Home Exam 3:
./run.sh --tegra tegra-[machine_number]
OR
- Compile with CMake:
mkdir build
cd build
cmake ..
make
- Encode with:
./c63enc <FILE_TO_BE_ENCODE> output.c63
- Decode with:
./c63dec output.c63 final.yuv
- Test Motion Estimation with:
./c63pred output.c63 pred.yuv
Run testing with bench.sh (from the build folder):
../../script/bench.sh
| Optimization, (threads) | Time (s) | Time per frame (ms) |
|---|---|---|
| Baseline (Host only) | 120.502 | 401.674 |
| Opt_1: Naive (8x8) | 28.791 | 95.968 |
| Opt_2: Parallel (32x32) | 22.254 | 74.179 |
| Opt_3: Parallel w/c (32x32) | 22.061 | 73.536 |
| Opt_4: Parallel w/c (16x16) | 21.182 | 70.606 |
| Opt_5: Parallel w/c (8x8) | 20.895 | 69.648 |
Table: Table of tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results. All tests were conducted on x86-1 (Intel Core i5-4590, Quadro K2200 GPU, 8 GB RAM, 5.0 compute capability).
| Optimization | Time (s) | Time/Frame (ms) |
|---|---|---|
| Baseline (Home Exam 1) | 21.642 | 70.866 |
| SoC Dram shared | 24.631 | 82.103 |
| Scale Vectorized (not pre) | 24.496 | 81.654 |
| Scale Vectorized (pre) | 24.383 | 81.276 |
| Quant divide | 23.775 | 79.251 |
| Quant multiply | 23.826 | 79.421 |
| Dequant divide | 24.017 | 80.058 |
| Dequant multiply | 23.690 | 78.969 |
| DCT_2D | 19.362 | 64.541 |
| iDCT_2D | 15.875 | 52.917 |
| Quantize DCT Threaded | 13.266 | 44.220 |
| Dequantize iDCT Threaded | 8.769 | 29.229 |
| Fixed core set | 8.607 | 28.691 |
| With 6 Cores | 7.528 | 25.094 |
| Dual FPU DCT_2D | 7.629 | 25.431 |
| Dual FPU iDCT_2D | 7.278 | 24.261 |
| Prefetch + Unroll iDCT/DCT | 7.364 | 24.549 |
| Prefetch + Unroll iDCT only | 7.154 | 23.847 |
| Thread Pool | 6.532 | 21.773 |
| FP16 DCT 2D | 6.525 | 21.749 |
| FP16 iDCT 2D | 5.992 | 19.973 |
| Memory Free and allocate Fix | 4.574 | 15.249 |
| Release | 2.977 | 9.924 |
Table: Table of tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results. Computed on NVIDIA Tegra Xavier 32GB with 7.2 Compute ability and support of ARMv8.2 NEON (Tegra 3).
| Optimizations | Time (s) | Time/Frame (ms) |
|---|---|---|
| Baseline (Home Exam 1) | 21.642 | 70.866 |
| Home Exam 2 | 2.977 | 9.924 |
| Three frames in pipeline | 16.64 | 55.4 |
Table: Table tested results on foreman.yuv encoding all 300 frames. Each configuration was tested 10 times, and the reported values represent the average results.
| Metrics | Average (ms) | std. (ms) |
|---|---|---|
| Encoder Waiting for Frames | 25.33 | 45.94 |
| Client Read to Written frame | 82.78 | 40.25 |
Table: Table tested on foreman.yuv encoding all 300 frames.
| Metrics | Average (ms) | std. (ms) |
|---|---|---|
| Encoder Waiting for Frames | 0.733 | 19.196 |
| Client Read to Written frame | 404.828 | 38.476 |
Table: Table tested on tractor.yuv encoding all 690 frames.