Optimize mb_strtoupper/mb_strtolower for UTF-8 enc and ASCII input #9166

mvorisek · 2022-07-27T08:42:47Z

fixes https://bugs.php.net/bug.php?id=79545

🚀 improve performance by 250 - 6 000 % 🚀

UTF-8 is "special" encoding for 2 reasons - it is default and also mostly (and almost only) used [1].

Pure ASCII strtolower/strtoupper case conversion functions are highly optimized using SSE2.

mb_strtoupper/mb_strtolower case conversion functions are complex and a lot of calls are involved during the conversion. Thus they will always be slower. Much slower.

In English speaking countries, most texts consist of 7-bit ASCII strings only. This usecase can be optimized internally so mbstring case conversion functions can be used without a performance drop.

Here is a benchmark:

code:

<?php

$cnt = 1_000_000;

$strs = [
    'empty' => '',
    'short' => 'zluty kun',
    'short_with_uc' => 'zluty Kun',
    'long' => str_repeat('this is about 10000 chars long string', 270),
    'long_with_uc' => str_repeat('this is about 10000 chars long String', 270),
    'short_utf8' => 'žlutý kůň',
    'short_utf8_with_uc' => 'Žlutý kŮň',
];

foreach ($strs as $k => $str) {
    $a1 = microtime(true);
    for($i=0; $i < $cnt; ++$i){
        $res = strtolower($str);
    }
    $t1 = microtime(true) - $a1;
    // echo 'it took ' . round($t1 * 1000, 3) . ' ms for ++$i'."\n";

    $a2 = microtime(true);
    for($i=0; $i < $cnt; $i++){
        $res = mb_strtolower($str);
    }
    $t2 = microtime(true) - $a2;
    // echo 'it took ' . round($t2 * 1000, 3) . ' ms for $i++'."\n";

    echo 'strtolower is ' . round($t2/$t1, 2) . 'x faster than mb_strtolower for ' . $k . "\n\n";
}

-strtolower is 3.4x faster than mb_strtolower for empty
+strtolower is 1.16x faster than mb_strtolower for empty

-strtolower is 4.72x faster than mb_strtolower for short
+strtolower is 1.23x faster than mb_strtolower for short

-strtolower is 2.76x faster than mb_strtolower for short_with_uc
+strtolower is 1.13x faster than mb_strtolower for short_with_uc

-strtolower is 61.34x faster than mb_strtolower for long
+strtolower is 1.7x faster than mb_strtolower for long

-strtolower is 32.89x faster than mb_strtolower for long_with_uc
+strtolower is 1.37x faster than mb_strtolower for long_with_uc

-strtolower is 5.38x faster than mb_strtolower for short_utf8
+strtolower is 5.63x faster than mb_strtolower for short_utf8

-strtolower is 5.31x faster than mb_strtolower for short_utf8_with_uc
+strtolower is 5.52x faster than mb_strtolower for short_utf8_with_uc

[1] https://w3techs.com/technologies/cross/character_encoding/ranking (~98% market share)

cmb69 · 2022-07-27T11:46:55Z

ext/mbstring/mbstring.c

 	}

-	char *newstr = mbstring_convert_case(PHP_UNICODE_CASE_UPPER, str, str_len, &ret_len, enc);
+	// optimize performance for UTF-8 encoding and input string consisting of lower/7-bit ASCII characters only


Yeah, but that deoptimizes the performance for non ASCII strings. I think we need to measure this.

The performance drop for non-ASCII strings is around 5%. Which is highly negligible, in short, one ASCII only string conversion gains this performance drop back for the next 20 non-ASCII conversions. 😅

Please note, the zend_str_is_utf8_pure_ascii check is optimized with SSE2.

Okay, but how often would there be pure ASCII strings. In the worst case never.

often 😃, consider this function for example for lowercasing a name list to build a case-insensitive index (for short strings, there is "worst case gain" of 250%, for longer strings the gain is even more)

cmb69 · 2022-07-27T12:03:52Z

@alexdowad, thoughts on this one?

alexdowad · 2022-07-27T12:30:44Z

@alexdowad, thoughts on this one?

Thanks for asking. My first thought is: thanks very much to @mvorisek for working on performance! That is good.

I think what we really want is not to check if a UTF-8 string is pure ASCII and then call the optimized strtolower for ASCII... what would be better would be to write a special, vectorized version of mb_strtolower which is dedicated for UTF-8, and which can handle all UTF-8 strings, not just those which are ASCII-only.

I have written such code, it's on my hard drive, but I am working on integrating the new text conversion filters first before integrating it.

mvorisek · 2022-07-27T12:45:46Z

@alexdowad the worst case performance drop is 5% and the average gain is around 300 % for shorter strings. I would be very suprised if you can get near the ASCII only case conversion functions performance as basically every condition or function invocation decreases the performance.

IMHO the fastest conversion for real data would always require a check if the string is consisting of 7-bit pure ASCII or not. The only improvement I can see is to process the input string in a chunks, so the optimization is done not only on the whole string, but on reasonably sized parts. One real world example is long text with one emoji character.

alexdowad · 2022-07-27T12:49:41Z

If @mvorisek wants to create a specialized mb_strtolower function just for UTF-8, I may dig up the SSE2 code I had written before and post it somewhere.

It is similar to the optimized strtolower... but while strtolower can just take 16-byte chunks of text and convert them to lowercase, mb_strtolower must check what prefix of those 16-bytes is ASCII only, generate a bitmask which covers the ASCII-only prefix, then convert that prefix to lowercase. (All of these operations are vectorized.)

Then if the 'prefix' was actually the entire 16-bytes, we repeat. Otherwise, the remaining suffix of those 16 bytes must be handled separately.

I was still working on the part which handles the remaining suffix which is not ASCII-only. The code worked, but I was trying to find the fastest and best way to write it. The part which handled the ASCII-only prefix was fine.

alexdowad · 2022-07-27T12:51:17Z

@alexdowad the worst case performance drop is 5% and the average gain is around 300 % for shorter strings. I would be very suprised if you can get near the ASCII only case conversion functions performance as basically every condition or function invocation decreases the performance.

Well, of course it does. But running over the entire string to check if it's ASCII or not also decreases performance.

See my comment above about how to combine the operation of checking whether the string is ASCII while also doing the lowercase conversion at the same time, so we don't have to run over it twice.

alexdowad · 2022-07-27T13:03:00Z

@mvorisek Just one more point to give some context: right now mb_strtolower (and just about all the other mbstring functions) are piggishly slow because the entire library is implemented in a very inefficient way. I am converting the library over to work on buffers of codepoints rather than "one codepoint for each function call", and this turns out to make most mbstring functions around 2-5 times faster.

mb_strtolower has not been converted over yet, but when it is, the overhead of making an extra pass over the string will look larger (because the 'base' function will be faster).

I totally agree that we should specialize for UTF-8, but feel that the specialized version should work on all UTF-8 strings. If you want to scan a UTF-8 string to find where the first non-ASCII character is, do vectorized conversion of a run of ASCII characters, then convert a run of non-ASCII characters using separate code, then again scan and convert a region of ASCII characters, until you reach the end of the string... that would be great. Then we would benefit not only on pure-ASCII strings, but on any UTF-8 string which contains any significant number of ASCII characters.

mvorisek · 2022-07-27T14:08:09Z

Yes. I am not sure if I have enought internals knowledge of optimal string reallocation/concatenation because of https://3v4l.org/5E9Um.

Also, I belive the mbstring_convert_case (or some lower level mbstring function) has some per call overhead, thus it should be probably called with larger than 16 bytes/SSE2 chunks to balance the performance between pure ASCII and pure non-ASCII strings.

Given the fact this PR has only 5% performance drop when the optimization cannot be used (string is non purely ASCII) and the fact the performance gain for pure ASCII strings is huge, @alexdowad are you against merging it? It would improve performance of most real world inputs and the complete solution can land later and after PHP 8.2.

Actually, as the gain is so huge, I ask if this simple fully BC optimization can even be landed into PHP 8.0 or at least PHP 8.1.

alexdowad · 2022-07-27T14:11:35Z

Given the fact this PR has only 5% performance drop when the optimization cannot be used (string is non purely ASCII) and the fact the performance gain for pure ASCII strings is huge, @alexdowad are you against merging it? It would improve performance of most real world inputs and the complete solution can land later and after PHP 8.2.

If the performance improvements which I discussed above will not land in PHP 8.2, then merging this would be nice. But it would be better to go directly to the better solution if we can.

How much time do we have left?

cmb69 · 2022-07-27T14:32:55Z

How much time do we have left?

There is roughly a month till RC1, and even afterwards some minor improvements could be done.

It would improve performance of most real world inputs […]

I still seriously doubt that. In the worst case you've claimed a 5% performance decrease.

Girgias · 2023-02-05T17:53:17Z

@alexdowad is this PR still relevant or not?

mvorisek · 2023-04-29T14:23:12Z

@alexdowad I agree better impl. is possible, would you mind to look into it and/or should we merge this simple, but still powerful, optimization?

alexdowad · 2023-04-29T14:58:49Z

@mvorisek Thanks for the follow-up.

Instead of optimizing just ASCII-only strings and nothing else, would it be possible to improve this PR so that (at least) we also optimize strings with an ASCII-only prefix?

You could use the BLOCKCONV stuff from zend_operators.c to upcase/downcase whatever prefix of the string is ASCII-only, then use the existing case conversion code in php_unicode.c to finish the remaining part of the string (if there is a non-ASCII suffix). Because php_unicode_convert_case does not allow for the possibility that a prefix has already been case-converted, you might need to either adjust the parameters for php_unicode_convert_case, or else pull out the main part of the function into a static function which can be called from two different higher-level functions.

I think the same optimization can also applied to all the EUC variants and Shift-JIS variants; they also use 0x00-7F for ASCII and >= 0x80 for other characters. I don't remember right now if we support any other text encodings which have that same property.

mvorisek · 2023-05-18T13:54:15Z

optimize strings with an ASCII-only prefix

zend_str_toupper_impl impl. has not check if data needs to/will be transformed or not and adding a check will more or less imply the current/this PR impl.

In my benchmarks #9166 (comment) the check if the whole string is ASCII is very cheap, the overall gain in the real world should be positive.

Using some ideal chunk size and processing long strings with minimal non-ASCII count in chunks will be better for non-pure-ASCII strings, but the total length of such inputs is not known upfront, thus I am not sure if the performance of the current impl./this PR can ever be beaten for pure-ASCII inputs. As I do not know all the internals I will not pursue a better impl. by myself. I would be happy if this PR can be benchmark on real world data and merged, but feel free to also close it and/or possible reuse it for better chunked impl.

derickr · 2023-09-28T12:07:39Z

It doesn't seem like there is consensus to merge this PR, has been stale since May, and now also has conflicts. I am therefore closing this PR>

github-actions bot added the Extension: mbstring label Jul 27, 2022

mvorisek force-pushed the optimize_mb_strtolower branch 2 times, most recently from 220e136 to fc9ec56 Compare July 27, 2022 10:53

github-actions bot added the Category: Engine label Jul 27, 2022

mvorisek force-pushed the optimize_mb_strtolower branch from fc9ec56 to f70ffe4 Compare July 27, 2022 10:56

Optimize mb_strtoupper/mb_strtolower for UTF-8 enc and ASCII input

0e418f8

mvorisek force-pushed the optimize_mb_strtolower branch from f70ffe4 to 0e418f8 Compare July 27, 2022 11:05

mvorisek marked this pull request as ready for review July 27, 2022 11:23

cmb69 reviewed Jul 27, 2022

View reviewed changes

derickr closed this Sep 28, 2023

mvorisek deleted the optimize_mb_strtolower branch October 9, 2023 12:50

mvorisek restored the optimize_mb_strtolower branch October 14, 2023 22:47

mvorisek deleted the optimize_mb_strtolower branch October 14, 2023 23:02

Uh oh!

Optimize mb_strtoupper/mb_strtolower for UTF-8 enc and ASCII input #9166

Optimize mb_strtoupper/mb_strtolower for UTF-8 enc and ASCII input #9166

Uh oh!

Conversation

mvorisek commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 improve performance by 250 - 6 000 % 🚀

Uh oh!

cmb69 Jul 27, 2022

Choose a reason for hiding this comment

Uh oh!

mvorisek Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmb69 Jul 27, 2022

Choose a reason for hiding this comment

Uh oh!

mvorisek Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmb69 commented Jul 27, 2022

Uh oh!

alexdowad commented Jul 27, 2022

Uh oh!

mvorisek commented Jul 27, 2022

Uh oh!

alexdowad commented Jul 27, 2022

Uh oh!

alexdowad commented Jul 27, 2022

Uh oh!

alexdowad commented Jul 27, 2022

Uh oh!

mvorisek commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdowad commented Jul 27, 2022

Uh oh!

cmb69 commented Jul 27, 2022

Uh oh!

Girgias commented Feb 5, 2023

Uh oh!

mvorisek commented Apr 29, 2023

Uh oh!

alexdowad commented Apr 29, 2023

Uh oh!

mvorisek commented May 18, 2023

Uh oh!

derickr commented Sep 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mvorisek commented Jul 27, 2022 •

edited

Loading

mvorisek Jul 27, 2022 •

edited

Loading

mvorisek Jul 27, 2022 •

edited

Loading

mvorisek commented Jul 27, 2022 •

edited

Loading