Describe the bug
sz_find_byte_serial incorrectly handles bytes > 0x7F when compiled with -fsigned-char
Summary
sz_find returns NULL for valid single-byte needles with values > 0x7F (128-255) when compiled with -fsigned-char. The bug is in the SWAR byte broadcast in sz_find_byte_serial and sz_rfind_byte_serial.
Root cause
In include/stringzilla/find.h, lines 395 and 430:
n_vec.u64 = (sz_u64_t)n[0] * 0x0101010101010101ull;
When -fsigned-char is active (common in some toolchains, CC 15 aarch64 toolchain), n[0] is a signed char. For byte value 0xBE:
- Without
-fsigned-char: n[0] = 190 (unsigned) → (sz_u64_t)190 = 0xBE → broadcast = 0xBEBEBEBEBEBEBEBE (correct)
- With
-fsigned-char: n[0] = -66 (signed) → (sz_u64_t)(-66) = 0xFFFFFFFFFFFFFFBE (sign extension!) → multiply produces garbage
The NEON path (sz_find_byte_neon) is unaffected because it uses vld1q_dup_u8((sz_u8_t const *)n) which correctly loads as unsigned. However, the serial tail loop (which runs for the last < 16 bytes after the NEON main loop) triggers the bug.
Fix
Cast n[0] to sz_u8_t before the multiplication:
// Before:
n_vec.u64 = (sz_u64_t)n[0] * 0x0101010101010101ull;
// After:
n_vec.u64 = (sz_u64_t)(sz_u8_t)n[0] * 0x0101010101010101ull;
Two instances in find.h:
- Line 395 (
sz_find_byte_serial)
- Line 430 (
sz_rfind_byte_serial)
Reproduction
#include <stringzilla/stringzilla.h>
#include <stdio.h>
#include <string.h>
int main() {
char haystack[] = "abc\xBE xyz";
char needle = '\xBE';
void* expected = memmem(haystack, 8, &needle, 1);
const char* actual = sz_find(haystack, 8, &needle, 1);
printf("memmem: %p, sz_find: %p\n", expected, actual);
// With -fsigned-char: memmem finds it, sz_find returns NULL
}
Compile with:
gcc -O3 -fsigned-char -o test test.c
Impact
Any project using -fsigned-char (which is the default on some ARM toolchains) will silently get wrong results for single-byte substring searches involving bytes 0x80-0xFF. This affects binary data, UTF-8 continuation bytes, and any non-ASCII content.
Environment
- Stringzilla v4.6.0 (also confirmed on v3.12.6)
- GCC 15.2.0, aarch64
-fsigned-char -O3
Check x86 as well
Steps to reproduce
See details
Expected behavior
See details
StringZilla version
V4.6
Operating System
AL2
Hardware architecture
Arm
Which interface are you using?
C implementation
Contact Details
No response
Are you open to being tagged as a contributor?
Is there an existing issue for this?
Code of Conduct
Describe the bug
sz_find_byte_serialincorrectly handles bytes > 0x7F when compiled with-fsigned-charSummary
sz_findreturns NULL for valid single-byte needles with values > 0x7F (128-255) when compiled with-fsigned-char. The bug is in the SWAR byte broadcast insz_find_byte_serialandsz_rfind_byte_serial.Root cause
In
include/stringzilla/find.h, lines 395 and 430:When
-fsigned-charis active (common in some toolchains, CC 15 aarch64 toolchain),n[0]is asigned char. For byte value0xBE:-fsigned-char:n[0]=190(unsigned) →(sz_u64_t)190=0xBE→ broadcast =0xBEBEBEBEBEBEBEBE(correct)-fsigned-char:n[0]=-66(signed) →(sz_u64_t)(-66)=0xFFFFFFFFFFFFFFBE(sign extension!) → multiply produces garbageThe NEON path (
sz_find_byte_neon) is unaffected because it usesvld1q_dup_u8((sz_u8_t const *)n)which correctly loads as unsigned. However, the serial tail loop (which runs for the last < 16 bytes after the NEON main loop) triggers the bug.Fix
Cast
n[0]tosz_u8_tbefore the multiplication:Two instances in
find.h:sz_find_byte_serial)sz_rfind_byte_serial)Reproduction
Compile with:
Impact
Any project using
-fsigned-char(which is the default on some ARM toolchains) will silently get wrong results for single-byte substring searches involving bytes 0x80-0xFF. This affects binary data, UTF-8 continuation bytes, and any non-ASCII content.Environment
-fsigned-char -O3Check x86 as well
Steps to reproduce
See details
Expected behavior
See details
StringZilla version
V4.6
Operating System
AL2
Hardware architecture
Arm
Which interface are you using?
C implementation
Contact Details
No response
Are you open to being tagged as a contributor?
.githistory as a contributorIs there an existing issue for this?
Code of Conduct