-
Couldn't load subscription status.
- Fork 8k
Optimize mb_strtoupper/mb_strtolower for UTF-8 enc and ASCII input #9166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
220e136 to
fc9ec56
Compare
fc9ec56 to
f70ffe4
Compare
f70ffe4 to
0e418f8
Compare
| } | ||
|
|
||
| char *newstr = mbstring_convert_case(PHP_UNICODE_CASE_UPPER, str, str_len, &ret_len, enc); | ||
| // optimize performance for UTF-8 encoding and input string consisting of lower/7-bit ASCII characters only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but that deoptimizes the performance for non ASCII strings. I think we need to measure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance drop for non-ASCII strings is around 5%. Which is highly negligible, in short, one ASCII only string conversion gains this performance drop back for the next 20 non-ASCII conversions. 😅
Please note, the zend_str_is_utf8_pure_ascii check is optimized with SSE2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but how often would there be pure ASCII strings. In the worst case never.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
often 😃, consider this function for example for lowercasing a name list to build a case-insensitive index (for short strings, there is "worst case gain" of 250%, for longer strings the gain is even more)
|
@alexdowad, thoughts on this one? |
Thanks for asking. My first thought is: thanks very much to @mvorisek for working on performance! That is good. I think what we really want is not to check if a UTF-8 string is pure ASCII and then call the optimized I have written such code, it's on my hard drive, but I am working on integrating the new text conversion filters first before integrating it. |
|
@alexdowad the worst case performance drop is 5% and the average gain is around 300 % for shorter strings. I would be very suprised if you can get near the ASCII only case conversion functions performance as basically every condition or function invocation decreases the performance. IMHO the fastest conversion for real data would always require a check if the string is consisting of 7-bit pure ASCII or not. The only improvement I can see is to process the input string in a chunks, so the optimization is done not only on the whole string, but on reasonably sized parts. One real world example is long text with one emoji character. |
|
If @mvorisek wants to create a specialized It is similar to the optimized Then if the 'prefix' was actually the entire 16-bytes, we repeat. Otherwise, the remaining suffix of those 16 bytes must be handled separately. I was still working on the part which handles the remaining suffix which is not ASCII-only. The code worked, but I was trying to find the fastest and best way to write it. The part which handled the ASCII-only prefix was fine. |
Well, of course it does. But running over the entire string to check if it's ASCII or not also decreases performance. See my comment above about how to combine the operation of checking whether the string is ASCII while also doing the lowercase conversion at the same time, so we don't have to run over it twice. |
|
@mvorisek Just one more point to give some context: right now
I totally agree that we should specialize for UTF-8, but feel that the specialized version should work on all UTF-8 strings. If you want to scan a UTF-8 string to find where the first non-ASCII character is, do vectorized conversion of a run of ASCII characters, then convert a run of non-ASCII characters using separate code, then again scan and convert a region of ASCII characters, until you reach the end of the string... that would be great. Then we would benefit not only on pure-ASCII strings, but on any UTF-8 string which contains any significant number of ASCII characters. |
|
Yes. I am not sure if I have enought internals knowledge of optimal string reallocation/concatenation because of https://3v4l.org/5E9Um. Also, I belive the Given the fact this PR has only 5% performance drop when the optimization cannot be used (string is non purely ASCII) and the fact the performance gain for pure ASCII strings is huge, @alexdowad are you against merging it? It would improve performance of most real world inputs and the complete solution can land later and after PHP 8.2. Actually, as the gain is so huge, I ask if this simple fully BC optimization can even be landed into PHP 8.0 or at least PHP 8.1. |
If the performance improvements which I discussed above will not land in PHP 8.2, then merging this would be nice. But it would be better to go directly to the better solution if we can. How much time do we have left? |
There is roughly a month till RC1, and even afterwards some minor improvements could be done.
I still seriously doubt that. In the worst case you've claimed a 5% performance decrease. |
|
@alexdowad is this PR still relevant or not? |
|
@alexdowad I agree better impl. is possible, would you mind to look into it and/or should we merge this simple, but still powerful, optimization? |
|
@mvorisek Thanks for the follow-up. Instead of optimizing just ASCII-only strings and nothing else, would it be possible to improve this PR so that (at least) we also optimize strings with an ASCII-only prefix? You could use the BLOCKCONV stuff from I think the same optimization can also applied to all the EUC variants and Shift-JIS variants; they also use 0x00-7F for ASCII and >= 0x80 for other characters. I don't remember right now if we support any other text encodings which have that same property. |
In my benchmarks #9166 (comment) the check if the whole string is ASCII is very cheap, the overall gain in the real world should be positive. Using some ideal chunk size and processing long strings with minimal non-ASCII count in chunks will be better for non-pure-ASCII strings, but the total length of such inputs is not known upfront, thus I am not sure if the performance of the current impl./this PR can ever be beaten for pure-ASCII inputs. As I do not know all the internals I will not pursue a better impl. by myself. I would be happy if this PR can be benchmark on real world data and merged, but feel free to also close it and/or possible reuse it for better chunked impl. |
|
It doesn't seem like there is consensus to merge this PR, has been stale since May, and now also has conflicts. I am therefore closing this PR> |
fixes https://bugs.php.net/bug.php?id=79545
🚀 improve performance by 250 - 6 000 % 🚀
UTF-8 is "special" encoding for 2 reasons - it is default and also mostly (and almost only) used [1].
Pure ASCII
strtolower/strtouppercase conversion functions are highly optimized using SSE2.mb_strtoupper/mb_strtolowercase conversion functions are complex and a lot of calls are involved during the conversion. Thus they will always be slower. Much slower.In English speaking countries, most texts consist of 7-bit ASCII strings only. This usecase can be optimized internally so
mbstringcase conversion functions can be used without a performance drop.Here is a benchmark:
code:
[1] https://w3techs.com/technologies/cross/character_encoding/ranking (~98% market share)