-
Notifications
You must be signed in to change notification settings - Fork 13
4.3 Ultra fast sub string search (en)
To search for a sub-string in another one, we often use the strstr in C or the std::string::find in C++. The least we can say is that it is far, very far from optimal...
There is another way to speed up find: Intrinsics Instructions.
These instructions make it possible to manipulate 128-bits or 256-bits registers in one step. AVX512, the latest version, even uses 512-bit registers...
A character is encoded on 8 bits, which yields: 32 characters per block of 256 bits (8x32= 256).
The idea here is to browse our string by block of 32 characters at a time to determine if our "sub-string" is present in one of these blocks.
This is not necessarily the most frequent case (though), but it will serve as a foundation for the rest of our improvements...
Note: We use the following abbreviations in our code:
//We only deal with unsigned char strings
#define USTR(s) (unsigned char*)s.c_str()
//For some reasons, Windows does not use the same naming as other systems (go figure)
//bitscanforward returns the position of the first bit set to 1 to the left
#ifdef WIN32
#define bitscanforward(r,x) if (!_BitScanForward(&r,x)) r = 0
#else
#define bitscanforward(r,x) r = _bit_scan_forward(x)
#endif
The algorithm is actually very simple. We read our string by block of 32 characters that we transform each time into a 256-bit number. Then, we check if the character we are looking for is present in this 32-character block.
- If it is absent, we read 32 more characters.
- If it is present, its position is returned.
// "c" is the character we're looking for in "src"...
char c = search[0];
//We construct a 256-bit number where each byte is equal to "c";
__m256i firstchar = _mm256_set1_epi8(c);
// We then scan our string in 32-character blocks
for (; (i + 31) < lensrc; i += 32) {
//"s" points to the current block
s=USTR(src)+i;
//We then transform these 32 bytes into a 256-bit number...
__m256i current_bytes = _mm256_loadu_si256((const __m256i *)s);
//We check if "current_byte" contains "c" somewhere.
current_bytes = _mm256_cmpeq_epi8(firstchar, current_bytes);
//If current_byte is different from 0, then "c" is present somewhere.
// For each block of 8 bits other than zero, one bit is set to 1 in q...
q = _mm256_movemask_epi8(current_bytes);
if (q) {
//Magic operation, which finds the position of the first bit at 1 to the left
//it returns in 'q' this position
bitscanforward(q, q);
return (i+q);
}
As you can see on this piece of code, the algorithm is not very complicated. Furthermore, the system very quickly eliminates entire sections of the string that it knows in advance will not contain the character it is looking for.
In fact, it is possible to extend this method to two, four or even eight characters.
For two characters, we combine them into a 16-bit integer...
int16_t cc = search[1];
cc <<= 8;
cc |= search[0];
// FINALLY, we create our FILTER on a 16-bit block...
__m256i firstchar = _mm256_set1_epi16(cc);
Then, for the comparison, _mm256_cmpeq_epi16 is used instead of _mm256_cmpeq_epi8, so that a 16-bit block comparison is performed instead of an 8-bit block comparison.
We can still tighten the constraint. Finding a sequence of four characters is even more specific than a sequence of two. We then use a 32-bit integer...
int32_t cc = search[3];
for (shift = 2; shift >= 0; shift--) {
cc <<= 8;
cc |= search [shift];
}
// FINALLY, we create our FILTER on a 32-bit block...
__m256i firstchar = _mm256_set1_epi32(cc);
We then do the comparison with _mm256_cmpeq_epi32 instead of _mm256_cmpeq_epi16.
For eight characters, the filter would be computed as:
int32_t cc = search[7];
for (shift = 6; shift >= 0; shift--) {
cc <<= 8;
cc |= search [shift];
}
// FINALLY, we create our FILTER on a 64-bit block...
__m256i firstchar = _mm256_set1_epi64x(cc);
Except that if you try to do a comparison byte by byte, the sequences must be perfectly aligned for it to work.
Therefore, a simple test is no longer enough:
- For 2 characters, you need 2 tests with a shift.
- For 4 characters, you need 4 tests with a shift.
- For 8 characters, you need 8 tests with a shift.
Then all hell breaks loose...
Because, it is not your grand-mother's delicate code anymore...
It's brutal...
First, we have our little loop to prepare our filter for 4 characters (32 bits)...
int32_t cc = search[3];
for (shift = 2; shift >= 0; shift--) {
cc <<= 8;
cc |= search[shift];
}
__m256i firstchar = _mm256_set1_epi32(cc);
Then we loop, lensrc is the actual size of the string, which we are looking our sub-string into...
As you can see here, we detect all potential targets for our filter.
//We then scan our string for the first two characters...
while (i < lenmx) {
//Note that !q is part of the "for" condition
for (; i < lenmx && !q; i += 32) {
//we load our section, the length should be larger than 16
//We compute all our potential targets...
for (shift = 3; shift >= 0; shift--) {
s=src + i + shift;
//we read 16 characters at a time, but with a sliding window
//of 15 characters (since we are looking for a double character)
current_bytes = _mm256_loadu_si256((const __m256i *)s);
//we check if we have a the first character of our string somewhere
//in this 16 character window...
current_bytes = _mm256_cmpeq_epi32(firstchar, current_bytes);
q <<= 1;
q |= _mm256_movemask_epi8(current_bytes);
}
}
Note that epi8 performs the comparison on 8 bits (thus one character at a time), epi16 on 16-bit block, epi32 on 32-bit block and epi64 on 64-bit blocks...
If we could not find any targets during the whole loop, we stop...
//We stop
if (!q)
break;
If we found a target and our sub-string length is 4, then we have our position
if (lensearch == 4) {
bitscanforward(j,q);
return (i+j-32);
}
However, if the length was over 4, then we need to check the remaining characters...
q bits encode their positions. (sort of...)
//i has been incremented by 32 once too many
//We go back to its actual value
pos = i - 32;
//q bits are set to 1 for each target filter found, 4 bits at a time then...
while (q) {
bitscanforward(j,q);
if (j) {
pos+=j;
s+=j;
q >>= j;
}
//we check the next remaining characters (see below)
if (charcomp(s+3,search+3,lensearch-3))
return (pos);
//we move to the next target..
//4 characters at a time
q >>= 4;
s+=4;
pos+=4;
}
}
charcomp checks if two strings are equal over a specific length...
// A small character-by-character comparison of two strings
inline bool charcomp(unsigned char* src, unsigned char* search, long lensearch) {
while (--lensearch) {
if (src [lensearch] != search [lensearch])
return false;
}
return true;
}
We built a small program in Tamgu to compare the versions with and without the SIMD instructions.
string s = @"
//Then we search for the first two characters of our string...
while (i < lenmx) {
for (; i < lenmx && !q; i += 32) {
//we load our section, the length must be greater than 16
//We calculate all our potential targets...
for (shift = 3; shift >= 0; shift--) {
s=src + i + shift ;
//we read 16 characters at a time, but with a sliding window
current_bytes = _mm256_loadu_si256((const __m256i *)s) ;
//we check if we have the first character of our string somewhere
//in this 16 characters window...
current_bytes = _mm256_cmpeq_epi32(firstchar, current_bytes) ;
//Then we keep our targets
q <<= 1 ;
q |= _mm256_movemask_epi8(current_bytes) ;
}
}
"@;
svector sv = ["we", "16", "characters", "skjqdk", "qslkdqsdk","§§", "cu", "c", "cur", "conser"];
string e;
int cpt;
int i;
chrono t1(c_millisecond);
for (i in <0,100000,1>) {
for (e in sv) {
if (e in s)
cpt++;
}
}
chrono t2(c_millisecond);
Test machine:
iMac Retina:
4.2 GHz Intel Core i7 quad-core 32 GB 2400 MHz DDR4 Performance
- With SIMD instructions: 2050ms
- Without SIMD instructions: 2500ms
The difference is 450ms knowing that part of the time is spent in common code...
Note that this code is only effective if the string in which we do our search is large (at least 32 characters). But in this case, this method can be three to four times faster than a conventional method. A grep for example with this type of algo becomes a killer...
This code is part of Tamgu and is implemented in find_intel_byte