Skip to content

Conversation

@camel-cdr
Copy link
Contributor

@camel-cdr camel-cdr commented Oct 7, 2025

This PR adds a custom odiff implementation using the RISC-V vector extension (RVV).

Since Zig doesn't expose SIMD intrinsics and the Zig Vector API is very limited (and can't express essential features of RVV), I implemented the code in C using RVV intrinsics.
Assembly would've also been possible, but since Zig ships its own C compiler, I thought C would be more maintainable.

Comparing two random images, I measured a 5.9x speedup for the diff computation on the BananaPI BPI-F3 (ignoring file IO and image decoding).

The tests all pass, even with a lower error rate than the current implementation:

Click to expand
$ zig build -Dcpu=generic_rv64+rva20u64+v -Ddynamic=true test-all # RISC-V with RVV
test-all
└─ test-integration
   └─ run test stderr
=== Color Delta Comparison Test ===
Format: Original(float) | SIMD(float) | Diff | Error%
-----------------------------------------------------------
slight RGB difference    :   0.505300 |   0.505300 | 0.000000 |  0.000%
black vs white           : 32857.133157 | 32857.132813 | 0.000345 |  0.000%
red vs green             : 16214.709176 | 16214.708984 | 0.000192 |  0.000%
blue vs yellow           : 35214.748955 | 35214.750000 | 0.001045 |  0.000%
semi-transparent slight difference:   0.127318 |   0.127318 | 0.000000 |  0.000%
transparent vs opaque black: 32857.133157 | 32857.132813 | 0.000345 |  0.000%
gray slight difference   :   2.021200 |   2.021200 | 0.000000 |  0.000%
pinkish slight difference:   1.784546 |   1.784546 | 0.000000 |  0.000%
minimal blue difference  :   0.160096 |   0.160096 | 0.000000 |  0.000%
dark red vs dark green   : 4085.533182 | 4085.533203 | 0.000021 |  0.000%
-----------------------------------------------------------
Max Error: 0.000%  |  Average Error: 0.000%

Specific test - Gray pixels:
Original: 0.505300010106008
SIMD: 0.5053000450134277
Difference: 0.000000034907419732554956
identical transparent: orig=0.000000, fixed=0.000000, diff=0.000000
identical opaque black: orig=0.000000, fixed=0.000000, diff=0.000000
identical white: orig=0.000000, fixed=0.000000, diff=0.000000
very small difference: orig=0.493481, fixed=0.493481, diff=0.000000
very small difference near white: orig=0.493481, fixed=0.493481, diff=0.000000
$ zig build -Dcpu=generic_rv64+rva20u64 -Ddynamic=true test-all # RISC-V without RVV
test-all
└─ test-integration
   └─ run test stderr
=== Color Delta Comparison Test ===
Format: Original(float) | SIMD(float) | Diff | Error%
-----------------------------------------------------------
slight RGB difference    :   0.505300 |   0.504395 | 0.000905 |  0.179%
black vs white           : 32857.133157 | 32813.811768 | 43.321390 |  0.132%
red vs green             : 16214.709176 | 16200.840332 | 13.868844 |  0.086%
blue vs yellow           : 35214.748955 | 35182.234131 | 32.514825 |  0.092%
semi-transparent slight difference:   0.127318 |   0.126953 | 0.000365 |  0.286%
transparent vs opaque black: 32857.133157 | 32813.811768 | 43.321390 |  0.132%
gray slight difference   :   2.021200 |   2.018311 | 0.002889 |  0.143%
pinkish slight difference:   1.784546 |   1.782715 | 0.001831 |  0.103%
minimal blue difference  :   0.160096 |   0.159912 | 0.000184 |  0.115%
dark red vs dark green   : 4085.533182 | 4082.038574 | 3.494608 |  0.086%
-----------------------------------------------------------
Max Error: 0.286%  |  Average Error: 0.135%

Specific test - Gray pixels:
Original: 0.505300010106008
SIMD: 0.50439453125
Difference: 0.0009054788560080018
identical transparent: orig=0.000000, fixed=0.000000, diff=0.000000
identical opaque black: orig=0.000000, fixed=0.000000, diff=0.000000
identical white: orig=0.000000, fixed=0.000000, diff=0.000000
very small difference: orig=0.493481, fixed=0.492676, diff=0.000805
very small difference near white: orig=0.493481, fixed=0.492676, diff=0.000805

The RVV backend is currently only enabled when the target ISA supports the "V" extension. I don't think it's worth adding runtime detection, because the expectation is that distros will move the baseline to RVA23, which supports RVV by default.

Some of the test logs needed to be modified to only log the fp64 result, such that current SIMD and my implementation can share the test code.

I wasn't able to simply add new riscv64 entries to build_targets, because currently zig fails do build libjpeg-turbo statically.
Not sure how to best add it to CI, but for development I used the following setup:

$ # on local system
$ apt install qemu-system-riscv64 qemu-user gcc-riscv64-linux-gnu-base podman
$ export QEMU_LD_PREFIX="/usr/riscv64-linux-gnu"
$ export QEMU_CPU="rva23u64"
$ podman run --platform=linux/riscv64 -it ubuntu:latest bash
$ # in container
$ apt update -y && apt upgrade -y && apt install -y libspng-dev libjpeg-dev libtiff-dev pkg-config wget tar xz-utils git
$ wget 'https://ziglang.org/download/0.15.1/zig-riscv64-linux-0.15.1.tar.xz'
$ tar xf zig-riscv64-linux-0.15.1.tar.xz && mv zig-riscv64-linux-0.15.1 zig
$ git clone --depth=1 https://github.com/camel-cdr/odiff/ && cd odiff
$ /zig/zig build -Dcpu=generic_rv64+rva20u64+v -Ddynamic=true test-all # test with RVV
$ /zig/zig build -Dcpu=generic_rv64+rva20u64 -Ddynamic=true test-all # test without RVV

@camel-cdr camel-cdr changed the title add RVV support Optimize for RISC-V Vector Extension Oct 7, 2025
@dmtrKovalenko dmtrKovalenko requested a review from Copilot October 7, 2025 22:27
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a custom odiff implementation using the RISC-V Vector Extension (RVV) for enhanced performance. The implementation provides a 5.9x speedup for diff computation on RISC-V systems with vector extensions while maintaining high accuracy.

  • Implements RVV-specific optimizations in C using vector intrinsics
  • Adds conditional compilation to use RVV when the target supports the "V" extension
  • Modifies test infrastructure to support both RVV and existing SIMD implementations

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/rvv.c New C implementation using RISC-V vector intrinsics for pixel difference calculations
src/diff.zig Adds RVV comparison function and conditional logic to use RVV when available
src/test_color_delta.zig Updates test infrastructure to work with both RVV and existing SIMD implementations
build.zig Adds RVV C source file to compilation and fixes test linking issues

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

src/rvv.c Outdated
Comment on lines 148 to 150
if (writeDiff) {
__riscv_vse32(m, diff, __riscv_vmv_v_x_u32m4(diffcol, vl), vl);
}
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff pointer is incremented in the loop but may be null when writeDiff is false. This could lead to undefined behavior when accessing diff += vl in the loop increment when diff is null.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's technically UB to increment a pointer past the object it's pointing, but compilers don't really exploit that, only if you deference it. Usually in RVV code incrementing pointers looks cleaner than using indices, so I thought I'd increment diff as well, but considering this is now hopefully going into the library I'll change it to use indices.

Comment on lines +344 to +345
pub noinline fn compareRVV(base: *const Image, comp: *const Image, diff_output: *?Image, diff_count: *u32, diff_lines: ?*DiffLines, ignore_regions: ?[]struct { u32, u32 }, max_delta: f64, options: DiffOptions) !void {
_ = ignore_regions;
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ignore_regions parameter is unused but accepted. Consider removing it from the function signature or implementing the functionality to avoid confusion.

Suggested change
pub noinline fn compareRVV(base: *const Image, comp: *const Image, diff_output: *?Image, diff_count: *u32, diff_lines: ?*DiffLines, ignore_regions: ?[]struct { u32, u32 }, max_delta: f64, options: DiffOptions) !void {
_ = ignore_regions;
pub noinline fn compareRVV(base: *const Image, comp: *const Image, diff_output: *?Image, diff_count: *u32, diff_lines: ?*DiffLines, max_delta: f64, options: DiffOptions) !void {

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left that unused argument there on purpose for future expansion.
Ideally you'd also handle the ignore_regions with SIMD code, but I wasn't sure how to best do it/how ignore_regions is commonly used.

@shreyassanthu77
Copy link
Collaborator

shreyassanthu77 commented Oct 9, 2025

btw @dmtrKovalenko before merging this I should first make libjpeg build for riscv64 so probably a good idea to merge #137, fix static linking libjpeg for riscv and then merge this one with everything working

oh and @camel-cdr dw about CI I will add it once I fix the build :)

@shreyassanthu77
Copy link
Collaborator

shreyassanthu77 commented Oct 10, 2025

heyy @camel-cdr can you rebase your changes on top of #137 and see if you can build and run on riscv statically please

EDIT: you can do zig build test-all -Dtarget=riscv64-linux -fqemu to run the tests directly in qemu btw

@camel-cdr camel-cdr force-pushed the main branch 11 times, most recently from 3daafcc to 277cc90 Compare October 10, 2025 20:45
@camel-cdr
Copy link
Contributor Author

camel-cdr commented Oct 10, 2025

Alright, this seems to work.
I've also added riscv64 to the CI tests, which required adding an additional check for builtin.cpu.arch, and two new new CI exports, one for the basic riscv64 target, and one for riscv64 with the feature flag RVA23.

@dmtrKovalenko
Copy link
Owner

I tried to fix your PR but I can not rebase the branch

@dmtrKovalenko
Copy link
Owner

The code looks and works correctly, I've run it on my rp linux machine so it should be good to merge

#include <stdint.h>
#include <stddef.h>

#if !__riscv_vector
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this more neatly? I would prefer this to be exposed by the

Suggested change
#if !__riscv_vector
#ifndef __riscv_vector
#error "Vector intrinsics require the vector extension."
#endif

#if !__riscv_vector

/* unused stubs */
uint32_t odiffRVV(uint32_t *src1, uint32_t *src2, size_t n, float max_delta, uint32_t *diff, uint32_t diffcol) { return 0; }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about this being unused symbol. Should we define a separate build var for test only targets and include this only for tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure how to best do an conditional extern function in zig, so I decided to always make it available but only use it for rvv.

#define YIQ_I_WEIGHT 0.299
#define YIQ_Q_WEIGHT 0.1957

#if 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part doesn't do anything, but document how I arrived at the different computation, which is equivilant to the more complex one used by the zig code.

#include <riscv_vector.h>

static inline vfloat32m4_t
rvv_yiq_diff(vuint8m1x4_t v4, vuint8m1x4_t y4, size_t vl)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to use __riscv_v_min_vlen it is not a blocker for sure just wondering what vector size was on the CPU you were testing this patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested for 128, 256, 512 and 1024 in qemu and 256 in hardware.
The code should work for any vector length.
The standard "V" extension reauires a minimum VLEN of 128, I thought __riscv_vector tests for the stadard "V", but checking the spec again, it's also defined for some of the embeded subsets.
So I meant to use __riscv_v, which actually checks for "V", but __riscv_vector && __riscv_v_elen_fp >= 32 should also work and potentially cover some embedded implementations.

Comment on lines +119 to +121
v = __riscv_vfmul(__riscv_vfmul(db, Y6, vl), db, vl);
v = __riscv_vfmacc(v, dg, __riscv_vfmacc(__riscv_vfmul(dg, Y4, vl), Y5, db, vl), vl);
v = __riscv_vfmacc(v, dr, __riscv_vfmacc(__riscv_vfmacc(__riscv_vfmul(dr, Y1, vl), Y2, dg, vl), Y3, db, vl), vl);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered doing the int math instead? I am not sure but I have a feeling that fpus on riscv machine are usually much slower than standard arithmetic units

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floating-point is usually fast on RISC-V hardware: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html

I don't think fixed-point is worth it these days, unless it allows you to reduce the element width.

I presume that implementations that can't pay the price for fast fp, will only be in embedded and use a RVV subset without fp support, so not standard "V", but e.g. zve32x

One thing I did try is using fp16, which almost worked, but failed the accuracy tests by a few percent.

@dmtrKovalenko
Copy link
Owner

dmtrKovalenko commented Oct 14, 2025

closed in favor of the PR where I added a little bit of CI test runners and fixed a few minor things.

Thank you for the contribution

P.S. Your contribution commit is left untouched

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants