CPU fallback: do color-conversion on GPU. #992

NicolasHug · 2025-10-21T22:37:13Z

Towards #943

This PR speeds up the BETA Cuda interface when we need to fallback on the CPU. The idea is simple: we do the color conversion on the GPU instead of doing them on the CPU. This has 2 benefits:

the color conversion is faster since it runs on GPU
the CPU -> GPU transfer is faster since we now transfer a smaller YUV frame instead of a bigger RGB frame.

Before

Decode Frame on CPU (fallback) -> YUV to RGB Conversion **on CPU** -> Send bigger RGB frame from CPU to GPU

Now

Decode Frame on CPU (fallback) -> Send smaller YUV frame from CPU to GPU -> YUV to RGB conversion **on GPU**

This is for the BETA interface only. I'll handle the FFmpeg interface as a follow-up.

Benchmarks

Benchmarks show 1.6X speed up on 1080p frames. We should expect the larger speed-ups for larger resolutions. I used this snippet.

import torch
from time import perf_counter_ns
import argparse
from pathlib import Path
from torchcodec.decoders import VideoDecoder, set_cuda_backend
from joblib import Parallel, delayed


def bench(f, *args, num_exp=100, warmup=0, **kwargs):
    for _ in range(warmup):
        f(*args, **kwargs)

    times = []
    for _ in range(num_exp):
        start = perf_counter_ns()
        f(*args, **kwargs)
        end = perf_counter_ns()
        times.append(end - start)
    return torch.tensor(times).float()


def report_stats(times, unit="ms"):
    mul = {
        "ns": 1,
        "µs": 1e-3,
        "ms": 1e-6,
        "s": 1e-9,
    }[unit]
    times = times * mul
    std = times.std().item()
    med = times.median().item()
    print(f"{med = :.2f}{unit} +- {std:.2f}")
    return med


def decode_one_video(video_path):
    with set_cuda_backend("beta"):
       decoder = VideoDecoder(str(video_path), device="cuda:0", seek_mode="approximate")
    indices = torch.arange(len(decoder))
    decoder.get_frames_at(indices)

    torch.cuda.synchronize()


parser = argparse.ArgumentParser()
parser.add_argument("video_path")
args = parser.parse_args()

times = bench(decode_one_video, video_path=args.video_path, warmup=1, num_exp=10)
report_stats(times)

It's impossible to benchmark the new and old strategies together, since we need recompilation. Also I had to add modify our code to enforce the CPU fallback to be activated on videos that would otherwise not be falling back (basically just changed if (!nativeNVDECSupport(codecContext))... to if(true))

#  OLD
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 543.16ms +- 56.39
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 533.95ms +- 55.50
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 659.17ms +- 56.99
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 549.27ms +- 53.50
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 538.66ms +- 34.52

#  NEW
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 318.23ms +- 8.21
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 319.42ms +- 11.30
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 325.05ms +- 9.74
~/dev/torchcodec-cuda (fallback-colorconversion*) » python benchmark_fallback.py h264_1080.mp4                           nicolashug@nicolashug-fedora-PW0H326Y
med = 325.94ms +- 12.64

NicolasHug · 2025-10-24T21:45:23Z

src/torchcodec/_core/CpuDeviceInterface.cpp

As can be seen above we are now using swscale in BetaCudaInterface.cpp to do the YUV -> NV12 conversion. So I had to extract out all the swscale logic away from CpuDeviceInteface.cpp and put it into FFmpegCommon.cpp so it can be reused across the two interfaces. Almost everything below this comment can be ignored and treated as copy/pasting code around. Just pay attention to the test at the bottom.

Dan-Flores · 2025-10-25T00:06:16Z

test/test_decoders.py

+        beta_frame = beta_dec.get_frame_at(0)

-        torch.testing.assert_close(ffmpeg.data, beta.data, rtol=0, atol=0)
+        assert psnr(ref_frames.data.cpu(), beta_frame.data.cpu()) > 25


Does comparing frames on GPU vs CPU change floating point precision, or is there some other reason to move the frames here?

there will be a small difference due to floating point precision, but that wasn't the reason.
In fact... there was no good reason to call .cpu(), that was probably the result of copy/pasting from my previous write_png debug. I removed it, thanks for catching!

Dan-Flores · 2025-10-25T00:06:30Z

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

+  UniqueAVFrame nv12CpuFrame(av_frame_alloc());
+  TORCH_CHECK(nv12CpuFrame != nullptr, "Failed to allocate NV12 CPU frame");
+
+  nv12CpuFrame->format = AV_PIX_FMT_NV12;


Is it accurate to say that NV12 very similar to AV_PIX_FMT_YUV420P (uses YUV, and has 4:2:0 chroma subsampling), but we use NV12 here because that is the format the NPP library requires? As explained in this comment

Yes that's exactly right. NV12 would contain the exact same values as AV_PIX_FMT_YUV420P, just ordered a bit differently.

scotts · 2025-10-27T13:08:14Z

src/torchcodec/_core/FFMPEGCommon.h

    AVFilterGraph* filterGraph,
    enum AVPixelFormat outputFormat);

+struct SwsFrameContext {


A refactoring we should probably do is pull all sws logic into its own class in separate .h and .cpp files, similar to what we have with FilterGraph. But not in this PR.

scotts · 2025-10-27T13:15:52Z

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

+      convertedHeight == height, "sws_scale failed for CPU->NV12 conversion");
+
+  int ySize = width * height;
+  int uvSize = ySize / 2; // NV12: UV plane is half the size of Y plane


Integer rounding is okay? This is implicitly a floor operation.

Thanks for catching this, I forgot to look into it. Will report back.

I decided to go the easy route and just TORCH_CHECK for evenness. I think frame dimension are always expected to be even anyway.

scotts · 2025-10-27T13:17:27Z

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

+      nv12CpuFrame->data[1],
+      nv12CpuFrame->linesize[1],
+      width,
+      height / 2,


Ditto for integer rounding - okay here?

scotts · 2025-10-27T13:19:14Z

test/test_decoders.py

+        beta_frame = beta_dec.get_frame_at(0)

-        torch.testing.assert_close(ffmpeg.data, beta.data, rtol=0, atol=0)
+        assert psnr(ref_frames.data, beta_frame.data) > 25


How did you choose 25? I ask because I'm probably going to have to do something similar with decoder-native transforms.

This is just the higher psnr value I could get that would pass the test. It's not great, 25 isn't a very good PSNR. But visually the frames are OK. The difference is mainly due to color-conversion algorithms being different on the GPU and on the CPU. Eventually, when the FFmpeg interface follows the same fallback mechanism, both frames will be exactly equal.

We should still try to have more CPU vs GPU tests for these code-paths though.

scotts

Awesome improvement! :)

…lorconversion

NicolasHug added 7 commits October 21, 2025 17:43

WIP

bf3e29b

WIP

d86a19e

WIP

7f88e1b

WIP

5c61a96

WIP

c6bda33

WIP

f4c8f4e

Docs

340974a

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 21, 2025

Add test

3afc97f

NicolasHug marked this pull request as ready for review October 24, 2025 21:42

NicolasHug commented Oct 24, 2025

View reviewed changes

Dan-Flores reviewed Oct 25, 2025

View reviewed changes

remove .cpu() call

89f1547

scotts reviewed Oct 27, 2025

View reviewed changes

scotts approved these changes Oct 27, 2025

View reviewed changes

NicolasHug added 2 commits October 29, 2025 16:51

Merge branch 'main' of github.com:pytorch/torchcodec into fallback-co…

042a35e

…lorconversion

Add evenness checks before dividing by 2

61d4c7f

NicolasHug merged commit 44ae3d5 into meta-pytorch:main Oct 29, 2025
59 checks passed

NicolasHug deleted the fallback-colorconversion branch October 29, 2025 18:29

NicolasHug mentioned this pull request Oct 30, 2025

CPU fallback: do color-conversion on GPU - 2 #1012

Draft

CPU fallback: do color-conversion on GPU. #992

CPU fallback: do color-conversion on GPU. #992

Uh oh!

Conversation

NicolasHug commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scotts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NicolasHug commented Oct 21, 2025 •

edited

Loading