Skip to content

Case-Insensitive UTF-8 Search with AVX-512 🌾🪡🌾#286

Merged
ashvardanian merged 147 commits intomainfrom
main-dev
Dec 15, 2025
Merged

Case-Insensitive UTF-8 Search with AVX-512 🌾🪡🌾#286
ashvardanian merged 147 commits intomainfrom
main-dev

Conversation

@ashvardanian
Copy link
Owner

@ashvardanian ashvardanian commented Nov 29, 2025

Below are the performance numbers comparing the search throughput of unique "word" tokens across various languages of the Leipzig Wikipedia Corpora for a case-insensitive substring search that respects all Unicode 17.0 case-folding rules. This is arguably the only library providing full Unicode spec compliance for search operations besides the PCRE2 library, which is often order(s) of magnitude slower than even our serial baseline due to the extreme complexity of combining a complete RegEx engine with Unicode compliance.

Corpora Language Script Serial Baseline, GB/s AVX-512 for Ice Lake+, GB/s Speedup
Latin (Basic)
🇬🇧 English Latin 1.15 10.93 11.9×
🇮🇹 Italian Latin 0.81 10.63 14.7×
🇳🇱 Dutch Latin 0.85 10.91 13.3×
Latin (Extended)
🇩🇪 German Latin+ß 0.74 9.36 13.6×
🇫🇷 French Latin+Acc 0.73 8.37 15.1×
🇪🇸 Spanish Latin+ñ 0.99 8.86 10.8×
🇵🇹 Portuguese Latin+Acc 0.77 9.58 14.3×
🇵🇱 Polish Latin+Ext 0.62 7.51 14.2×
🇨🇿 Czech Latin+Háčky 0.43 6.10 17.1×
🇹🇷 Turkish Latin+İ/ı 0.81 6.78 11.7×
🇻🇳 Vietnamese Latin+Tones 0.41 6.38 17.9×
Cyrillic
🇷🇺 Russian Cyrillic 0.54 3.41 10.6×
🇺🇦 Ukrainian Cyrillic 0.56 4.03 10.6×
Greek
🇬🇷 Greek Greek 0.31 7.04 22.5×
Caucasian
🇦🇲 Armenian Armenian 0.34 4.18 17.5×
🇬🇪 Georgian Georgian 0.65 10.56 24.2×
Semitic
🇮🇱 Hebrew Hebrew 0.65 9.52 13.7×
🇸🇦 Arabic Arabic 1.17 9.85 9.8×
🇮🇷 Persian Arabic+Ext 0.41 11.83 43.1×
Indic
🇮🇳 Hindi Devanagari 1.25 10.99 16.3×
🇧🇩 Bengali Bengali 0.72 11.03 25.9×
🇮🇳 Tamil Tamil 1.09 11.70 21.0×
CJK & East Asian
🇯🇵 Japanese CJK+Kana 0.52 11.56 26.7×
🇰🇷 Korean Hangul 2.98 11.58 3.5×
🇨🇳 Chinese CJK 0.43 20.07 103.0×

The new tests detect a bug in handling inputs
like "中ABC".
New behaviour differs for `str` and `bytes` args
This design is cleaner, but I'm not seeing any
gains on AMD Zen5.

Closes #240
Port pressure went down from 8+6 on p5 and p0
respectively, to 6+5.
Naive: 12 p5
Before: 10 p5 ops + 1 p0 op
After: 8 p5 ops + 4 p0 ops
Combines XOR + VPTERNLOG + VPTESTNMB to
reduce port pressure on Intel CPUs
Yields a 30% performance improvement
in a such megakernels with sequential
memory access pattern
150x improvement over PyICU `icu.StringSearch` baseline
@ashvardanian ashvardanian merged commit ca7e505 into main Dec 15, 2025
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant