-
Notifications
You must be signed in to change notification settings - Fork 100
Optimize for RISC-V Vector Extension #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a custom odiff implementation using the RISC-V Vector Extension (RVV) for enhanced performance. The implementation provides a 5.9x speedup for diff computation on RISC-V systems with vector extensions while maintaining high accuracy.
- Implements RVV-specific optimizations in C using vector intrinsics
- Adds conditional compilation to use RVV when the target supports the "V" extension
- Modifies test infrastructure to support both RVV and existing SIMD implementations
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/rvv.c | New C implementation using RISC-V vector intrinsics for pixel difference calculations |
| src/diff.zig | Adds RVV comparison function and conditional logic to use RVV when available |
| src/test_color_delta.zig | Updates test infrastructure to work with both RVV and existing SIMD implementations |
| build.zig | Adds RVV C source file to compilation and fixes test linking issues |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/rvv.c
Outdated
| if (writeDiff) { | ||
| __riscv_vse32(m, diff, __riscv_vmv_v_x_u32m4(diffcol, vl), vl); | ||
| } |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff pointer is incremented in the loop but may be null when writeDiff is false. This could lead to undefined behavior when accessing diff += vl in the loop increment when diff is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's technically UB to increment a pointer past the object it's pointing, but compilers don't really exploit that, only if you deference it. Usually in RVV code incrementing pointers looks cleaner than using indices, so I thought I'd increment diff as well, but considering this is now hopefully going into the library I'll change it to use indices.
| pub noinline fn compareRVV(base: *const Image, comp: *const Image, diff_output: *?Image, diff_count: *u32, diff_lines: ?*DiffLines, ignore_regions: ?[]struct { u32, u32 }, max_delta: f64, options: DiffOptions) !void { | ||
| _ = ignore_regions; |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ignore_regions parameter is unused but accepted. Consider removing it from the function signature or implementing the functionality to avoid confusion.
| pub noinline fn compareRVV(base: *const Image, comp: *const Image, diff_output: *?Image, diff_count: *u32, diff_lines: ?*DiffLines, ignore_regions: ?[]struct { u32, u32 }, max_delta: f64, options: DiffOptions) !void { | |
| _ = ignore_regions; | |
| pub noinline fn compareRVV(base: *const Image, comp: *const Image, diff_output: *?Image, diff_count: *u32, diff_lines: ?*DiffLines, max_delta: f64, options: DiffOptions) !void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left that unused argument there on purpose for future expansion.
Ideally you'd also handle the ignore_regions with SIMD code, but I wasn't sure how to best do it/how ignore_regions is commonly used.
|
btw @dmtrKovalenko before merging this I should first make libjpeg build for riscv64 so probably a good idea to merge #137, fix static linking libjpeg for riscv and then merge this one with everything working oh and @camel-cdr dw about CI I will add it once I fix the build :) |
|
heyy @camel-cdr can you rebase your changes on top of #137 and see if you can build and run on riscv statically please EDIT: you can do zig build test-all -Dtarget=riscv64-linux -fqemu to run the tests directly in qemu btw |
3daafcc to
277cc90
Compare
|
Alright, this seems to work. |
|
I tried to fix your PR but I can not rebase the branch |
|
The code looks and works correctly, I've run it on my rp linux machine so it should be good to merge |
| #include <stdint.h> | ||
| #include <stddef.h> | ||
|
|
||
| #if !__riscv_vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this more neatly? I would prefer this to be exposed by the
| #if !__riscv_vector | |
| #ifndef __riscv_vector | |
| #error "Vector intrinsics require the vector extension." | |
| #endif |
| #if !__riscv_vector | ||
|
|
||
| /* unused stubs */ | ||
| uint32_t odiffRVV(uint32_t *src1, uint32_t *src2, size_t n, float max_delta, uint32_t *diff, uint32_t diffcol) { return 0; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about this being unused symbol. Should we define a separate build var for test only targets and include this only for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure how to best do an conditional extern function in zig, so I decided to always make it available but only use it for rvv.
| #define YIQ_I_WEIGHT 0.299 | ||
| #define YIQ_Q_WEIGHT 0.1957 | ||
|
|
||
| #if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part doesn't do anything, but document how I arrived at the different computation, which is equivilant to the more complex one used by the zig code.
| #include <riscv_vector.h> | ||
|
|
||
| static inline vfloat32m4_t | ||
| rvv_yiq_diff(vuint8m1x4_t v4, vuint8m1x4_t y4, size_t vl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to use __riscv_v_min_vlen it is not a blocker for sure just wondering what vector size was on the CPU you were testing this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested for 128, 256, 512 and 1024 in qemu and 256 in hardware.
The code should work for any vector length.
The standard "V" extension reauires a minimum VLEN of 128, I thought __riscv_vector tests for the stadard "V", but checking the spec again, it's also defined for some of the embeded subsets.
So I meant to use __riscv_v, which actually checks for "V", but __riscv_vector && __riscv_v_elen_fp >= 32 should also work and potentially cover some embedded implementations.
| v = __riscv_vfmul(__riscv_vfmul(db, Y6, vl), db, vl); | ||
| v = __riscv_vfmacc(v, dg, __riscv_vfmacc(__riscv_vfmul(dg, Y4, vl), Y5, db, vl), vl); | ||
| v = __riscv_vfmacc(v, dr, __riscv_vfmacc(__riscv_vfmacc(__riscv_vfmul(dr, Y1, vl), Y2, dg, vl), Y3, db, vl), vl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you considered doing the int math instead? I am not sure but I have a feeling that fpus on riscv machine are usually much slower than standard arithmetic units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
floating-point is usually fast on RISC-V hardware: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html
I don't think fixed-point is worth it these days, unless it allows you to reduce the element width.
I presume that implementations that can't pay the price for fast fp, will only be in embedded and use a RVV subset without fp support, so not standard "V", but e.g. zve32x
One thing I did try is using fp16, which almost worked, but failed the accuracy tests by a few percent.
|
closed in favor of the PR where I added a little bit of CI test runners and fixed a few minor things. Thank you for the contribution
|
This PR adds a custom odiff implementation using the RISC-V vector extension (RVV).
Since Zig doesn't expose SIMD intrinsics and the Zig Vector API is very limited (and can't express essential features of RVV), I implemented the code in C using RVV intrinsics.
Assembly would've also been possible, but since Zig ships its own C compiler, I thought C would be more maintainable.
Comparing two random images, I measured a 5.9x speedup for the diff computation on the BananaPI BPI-F3 (ignoring file IO and image decoding).
The tests all pass, even with a lower error rate than the current implementation:
Click to expand
The RVV backend is currently only enabled when the target ISA supports the "V" extension. I don't think it's worth adding runtime detection, because the expectation is that distros will move the baseline to RVA23, which supports RVV by default.
Some of the test logs needed to be modified to only log the
fp64result, such that current SIMD and my implementation can share the test code.I wasn't able to simply add new riscv64 entries to
build_targets, because currently zig fails do build libjpeg-turbo statically.Not sure how to best add it to CI, but for development I used the following setup: