Skip to content

Bug: sz_find_byte_serial incorrectly handles bytes > 0x7F when compiled with -fsigned-char #306

@belugabehr

Description

@belugabehr

Describe the bug

sz_find_byte_serial incorrectly handles bytes > 0x7F when compiled with -fsigned-char

Summary

sz_find returns NULL for valid single-byte needles with values > 0x7F (128-255) when compiled with -fsigned-char. The bug is in the SWAR byte broadcast in sz_find_byte_serial and sz_rfind_byte_serial.

Root cause

In include/stringzilla/find.h, lines 395 and 430:

n_vec.u64 = (sz_u64_t)n[0] * 0x0101010101010101ull;

When -fsigned-char is active (common in some toolchains, CC 15 aarch64 toolchain), n[0] is a signed char. For byte value 0xBE:

  • Without -fsigned-char: n[0] = 190 (unsigned) → (sz_u64_t)190 = 0xBE → broadcast = 0xBEBEBEBEBEBEBEBE (correct)
  • With -fsigned-char: n[0] = -66 (signed) → (sz_u64_t)(-66) = 0xFFFFFFFFFFFFFFBE (sign extension!) → multiply produces garbage

The NEON path (sz_find_byte_neon) is unaffected because it uses vld1q_dup_u8((sz_u8_t const *)n) which correctly loads as unsigned. However, the serial tail loop (which runs for the last < 16 bytes after the NEON main loop) triggers the bug.

Fix

Cast n[0] to sz_u8_t before the multiplication:

// Before:
n_vec.u64 = (sz_u64_t)n[0] * 0x0101010101010101ull;

// After:
n_vec.u64 = (sz_u64_t)(sz_u8_t)n[0] * 0x0101010101010101ull;

Two instances in find.h:

  • Line 395 (sz_find_byte_serial)
  • Line 430 (sz_rfind_byte_serial)

Reproduction

#include <stringzilla/stringzilla.h>
#include <stdio.h>
#include <string.h>

int main() {
    char haystack[] = "abc\xBE xyz";
    char needle = '\xBE';

    void* expected = memmem(haystack, 8, &needle, 1);
    const char* actual = sz_find(haystack, 8, &needle, 1);

    printf("memmem: %p, sz_find: %p\n", expected, actual);
    // With -fsigned-char: memmem finds it, sz_find returns NULL
}

Compile with:

gcc -O3 -fsigned-char -o test test.c

Impact

Any project using -fsigned-char (which is the default on some ARM toolchains) will silently get wrong results for single-byte substring searches involving bytes 0x80-0xFF. This affects binary data, UTF-8 continuation bytes, and any non-ASCII content.

Environment

  • Stringzilla v4.6.0 (also confirmed on v3.12.6)
  • GCC 15.2.0, aarch64
  • -fsigned-char -O3

Check x86 as well

Steps to reproduce

See details

Expected behavior

See details

StringZilla version

V4.6

Operating System

AL2

Hardware architecture

Arm

Which interface are you using?

C implementation

Contact Details

No response

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions