Remove zend_strtod mutex by arnaud-lb · Pull Request #13974 · php/php-src

arnaud-lb · 2024-04-15T17:39:21Z

zend_strtod.c uses a global state (mostly an allocation freelist) protected by a mutex in ZTS builds. This state is used by zend_dtoa(), zend_strtod(), and variants. This creates a lot of contention in concurrent loads. zend_dtoa() is used to format floats to string, e.g. in sprintf, json_encode, serialize, uniqid.

In this PR I move the global state to the thread specific executor_globals and remove the mutex.

The impact on non-concurrent environments is null or negligible, but there is a considerable speed up on concurrent environments, especially on Alpine/Musl. When comparing master to this branch, the frankenphp-demo is sped up 10% under Apache/musl, 20% under FrankenPHP/glibc, and 40% under FrankenPHP/musl. Some synthetic benchmark is 80% faster.

Benchmarks:

I'm using two benchmarks:

frankenphp-demo (requesting /api/monsters.jsonld). In this benchmark, the frankenphp-demo app is setup in dev mode
json_encode.php is a synthetic benchmark encoding an array or 100 floats

In 3 separate environments:

php-cgi without concurrency
Apache mpm_event mod_php ZTS (100 concurrent requests)
FrankenPHP in worker mode (100 concurrent requests)

Opcache is enabled in the php-cgi and apache benchmarks, otherwise compilation time dominates. It is disabled in FrankenPHP because it is redundant in worker mode.

Bookworm uses glibc, Alpine (3.19.1) uses musl (1.2.4).

Results:

php-cgi -T10,500 frankenphp-demo repeated 5 times:

master-bookworm:      mean:  1.7227;  stddev:  0.0035;  diff:  +0.00% (baseline)
branch-bookworm:      mean:  1.7195;  stddev:  0.0037;  diff:  -0.19%  

master-alpine:        mean:  1.8676;  stddev:  0.0031;  diff:  -0.00% (baseline)
branch-alpine:        mean:  1.8700;  stddev:  0.0029;  diff:  +0.13%  

master-zts-bookworm:  mean:  1.7909;  stddev:  0.0026;  diff:  -0.00% (baseline)
branch-zts-bookworm:  mean:  1.7943;  stddev:  0.0014;  diff:  +0.19%  

master-zts-alpine:    mean:  1.9928;  stddev:  0.0059;  diff:  -0.00% (baseline)
branch-zts-alpine:    mean:  1.9900;  stddev:  0.0031;  diff:  -0.14%

Also in Valgrind:

valgrind php-cgi -T1,10:

master-bookworm:      mean:  1200273448.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-bookworm:      mean:  1200174784.0000;  stddev:  0.0000;  diff:  -0.01%

master-alpine:        mean:  1193572485.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-alpine:        mean:  1193577574.0000;  stddev:  0.0000;  diff:  +0.00%

master-zts-bookworm:  mean:  1245688708.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-bookworm:  mean:  1245684830.0000;  stddev:  0.0000;  diff:  -0.00%

master-zts-alpine:    mean:  1275329963.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-alpine:    mean:  1275302862.0000;  stddev:  0.0000;  diff:  -0.00%

php-cgi -T10,5000 json_encode.php repeated 5 times:

master-bookworm:      mean:  0.2541;  stddev:  0.0003;  diff:  -0.00% (baseline)
branch-bookworm:      mean:  0.2532;  stddev:  0.0002;  diff:  -0.33%

master-alpine:        mean:  0.2694;  stddev:  0.0057;  diff:  +0.00% (baseline)
branch-alpine:        mean:  0.2702;  stddev:  0.0098;  diff:  +0.30%

master-zts-bookworm:  mean:  0.4092;  stddev:  0.0014;  diff:  -0.00% (baseline)
branch-zts-bookworm:  mean:  0.2665;  stddev:  0.0023;  diff:  -34.86%

master-zts-alpine:    mean:  0.5691;  stddev:  0.0015;  diff:  -0.00% (baseline)
branch-zts-alpine:    mean:  0.2862;  stddev:  0.0006;  diff:  -49.71%

Valgrind:

master-bookworm:      mean:  67901890.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-bookworm:      mean:  68082973.0000;  stddev:  0.0000;  diff:  +0.27%

master-alpine:        mean:  39943906.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-alpine:        mean:  40117586.0000;  stddev:  0.0000;  diff:  +0.43%

master-zts-bookworm:  mean:  74304991.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-bookworm:  mean:  69262486.0000;  stddev:  0.0000;  diff:  -6.79%

master-zts-alpine:    mean:  45453755.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-alpine:    mean:  41906922.0000;  stddev:  0.0000;  diff:  -7.80%

Apache mpm_event mod_php ZTS frankenphp-demo:

master-zts-bookworm: 10.863000; +0.00% (baseline)
branch-zts-bookworm: 10.876000; +0.12%

master-zts-alpine: 12.218000; +0.00% (baseline)
branch-zts-alpine: 10.885000; -10.91%

Apache mpm_event mod_php ZTS json_encode.php:

master-zts-bookworm: 1.476000; +0.00% (baseline)
branch-zts-bookworm: 0.228000; -84.55%

master-zts-alpine: 1.499000; +0.00% (baseline)
branch-zts-alpine: 0.243000; -83.79%

FrankenPHP frankenphp-demo:

master-bookworm:       77   +0.00% (baseline)
branch-bookworm:       62   -18.99%

master-alpine:  120  +0.00% (baseline)
branch-alpine:  68   -43.57%

arnaud-lb · 2024-04-15T17:51:40Z

Unsurprisingly the change is very visible in perf:
(here for FrankenPHP on Alpine)

# Event 'cpu_atom/cycles/P'
#
# Baseline  Delta Abs  Shared Object        Symbol                                                                 
# ........  .........  ...................  .......................................................................
#
    30.24%    -30.23%  [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
     7.80%     +9.31%  libphp.so            [.] execute_ex
     2.39%     +4.45%  libphp.so            [.] zend_gc_collect_cycles
     4.62%     +4.21%  libz.so.1.3.1        [.] 0x0000000000003da8
     3.74%     -3.74%  ld-musl-x86_64.so.1  [.] pthread_mutex_timedlock
     2.90%     -2.89%  ld-musl-x86_64.so.1  [.] pthread_mutex_lock
     0.98%     +2.19%  libphp.so            [.] gc_scan
     1.89%     -1.88%  ld-musl-x86_64.so.1  [.] pthread_mutex_unlock
     1.06%     +1.43%  libphp.so            [.] zend_hash_find
     1.30%     +1.38%  frankenphp           [.] 0x0000000000009f06
     1.76%     +1.34%  ld-musl-x86_64.so.1  [.] memcpy
     1.22%     -1.21%  [kernel.kallsyms]    [k] futex_wake
     1.21%     -1.19%  [kernel.kallsyms]    [k] try_to_wake_up

iluuu1994

Nice! Only a shallow review, but couldn't find any mistakes. 👍

bwoebi · 2024-04-16T10:44:38Z

Zend/zend_strtod.c

+#ifdef MULTIPLE_THREADS
 static MUTEX_T dtoa_mutex;
 static MUTEX_T pow5mult_mutex;
 #endif /* ZTS */



Suggested change

#ifdef MULTIPLE_THREADS

static MUTEX_T dtoa_mutex;

static MUTEX_T pow5mult_mutex;

#endif /* ZTS */

You forgot to remove these?

And the now obsolete usages of the acquire/release macros. I suppose you just did the minimum to draft this for now.

I didn't plan to remove this and the acquire/release macros as they are part of the "API" of the file, which appears to be a reusable piece of code imported from elsewhere. These macros are documented at the beginning of the file:

php-src/Zend/zend_strtod.c

Lines 147 to 155 in 077891f

* #define MULTIPLE_THREADS if the system offers preemptively scheduled

* multiple threads. In this case, you must provide (or suitably

* #define) two locks, acquired by ACQUIRE_DTOA_LOCK(n) and freed

* by FREE_DTOA_LOCK(n) for n = 0 or 1. (The second lock, accessed

* in pow5mult, ensures lazy evaluation of only one copy of high

* powers of 5; omitting this lock would introduce a small

* probability of wasting memory, but would otherwise be harmless.)

* You must also invoke freedtoa(s) to free the value s returned by

* dtoa. You may do so whether or not MULTIPLE_THREADS is #defined.

There are many knobs like this in this file, many of which we will never use, like KR_headers.

So I only removed the definition of MULTIPLE_THREADS, and left the default no-op definitions of ACQUIRE_DTOA_LOCK and FREE_DTOA_LOCK.

I don't mind removing their use as well if you think it's better.

I feel that we should eventually replace this code by more modern implementations of strtod and dtoa. It should be possible to implement these without memory allocations. Also I don't know if we still need to support VAX/IBM arithmetic. This feels risky and largely out of scope of this PR however.

dkarlovi · 2024-04-16T14:01:47Z

Already value of looking into supporting musl, amazing work!

devnexen · 2024-04-16T20:31:10Z

@arnaud-lb just curious, any change in the perf improvement since you moved to system allocation ?

arnaud-lb · 2024-04-17T12:04:52Z

@devnexen no, results are the same

I switched back to system malloc because zend_dtoa may be used outside of the request lifecycle via e.g. zend_error("... %f").

dstogov · 2024-04-17T19:25:12Z

I've never looked into zend_strtod.c code before and I got "a culture shock" :)

As I understood they implemented their own malloc cache, then added mutexes to make it thread safe...
I would suggest to try removing this caches (remove freelists and modify Balloc/Bfree to use [e]malloc/[e]free).

p5s is a linked list that caches precomputed numbers - 5**n where n < 32 - (5, 25, 125, 625, ..., 5**31).
It should be possible to pre-compute 32 numbers...

zend_dtoa() uses thread safe variable to keep the resulting string. But we use it just in two places and explicitly free the allocated memory. Switching to explicit [e]malloc and [e]free won't make any difference.

Anyway, I don't object against this PR. It doesn't make things worse.

arnaud-lb · 2024-04-18T15:57:10Z

@dstogov thank you for the review. I've tried here, but the micro benchmark is 20% slower after removing the freelist (2.5% under valgrind). There is no slowdown on other benchmarks, however. Let me know what you prefer.

dstogov · 2024-04-19T08:23:54Z

@dstogov thank you for the review. I've tried here, but the micro benchmark is 20% slower after removing the freelist (2.5% under valgrind). There is no slowdown on other benchmarks, however. Let me know what you prefer.

Thanks! I'll take a look on Monday.

crrodriguez · 2024-04-19T14:37:58Z

@arnaud-lb Im with @dstogov here, the freelists cache is not something you really want to have. it will hide bugs for very little benefit.

dstogov · 2024-04-22T10:14:33Z

@arnaud-lb Im with @dstogov here, the freelists cache is not something you really want to have. it will hide bugs for very little benefit.

I just asked to test the profitability of freelists, and @arnaud-lb showed that benefit is significant - 20%.

Of course, we shouldn't make 20% slowdown even for synthetic tests.
So the idea with per-thread freelists caches makes sense.

dstogov

I have to admit that my first impression from dtoa() implementation was wrong.
Its freelists cache implementation makes sense.
Making this caches thread-local also makes sense.

This happens because on ZTS we execute `executor_globals_ctor` which reset the `freelist` and `p5s` pointers, while on NTS we don't. On NTS we can reuse the caches but on ZTS we can't, the easiest fix is to call `zend_shutdown_strtod` when preloading is shut down. This regressed in phpGH-13974 and therefore only exists in PHP 8.4 and higher.

This happens because on ZTS we execute `executor_globals_ctor` which reset the `freelist` and `p5s` pointers, while on NTS we don't. On NTS we can reuse the caches but on ZTS we can't, the easiest fix is to call `zend_shutdown_strtod` when preloading is shut down. This regressed in GH-13974 and therefore only exists in PHP 8.4 and higher. Closes GH-16602.

Remove zend_strtod mutex

077891f

github-actions bot added the Category: Engine label Apr 15, 2024

arnaud-lb changed the title ~~Remove zend_strtod mutex~~ [wip] Remove zend_strtod mutex Apr 15, 2024

iluuu1994 reviewed Apr 16, 2024

View reviewed changes

bwoebi reviewed Apr 16, 2024

View reviewed changes

Use system malloc

98a8324

arnaud-lb changed the title ~~[wip] Remove zend_strtod mutex~~ Remove zend_strtod mutex Apr 17, 2024

arnaud-lb marked this pull request as ready for review April 17, 2024 12:05

arnaud-lb requested a review from dstogov as a code owner April 17, 2024 12:05

dkarlovi mentioned this pull request Apr 22, 2024

Optional GNU C Library support crazywhalecc/static-php-cli#376

Closed

dstogov approved these changes Apr 22, 2024

View reviewed changes

arnaud-lb merged commit 9bbc195 into php:master Apr 23, 2024

ndossche mentioned this pull request Oct 25, 2024

Fix GH-16577: EG(strtod_state).freelist leaks with opcache.preload #16602

Closed

	* #define MULTIPLE_THREADS if the system offers preemptively scheduled
	* multiple threads. In this case, you must provide (or suitably
	* #define) two locks, acquired by ACQUIRE_DTOA_LOCK(n) and freed
	* by FREE_DTOA_LOCK(n) for n = 0 or 1. (The second lock, accessed
	* in pow5mult, ensures lazy evaluation of only one copy of high
	* powers of 5; omitting this lock would introduce a small
	* probability of wasting memory, but would otherwise be harmless.)
	* You must also invoke freedtoa(s) to free the value s returned by
	* dtoa. You may do so whether or not MULTIPLE_THREADS is #defined.

Conversation

arnaud-lb commented Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arnaud-lb commented Apr 15, 2024

Uh oh!

iluuu1994 left a comment

Choose a reason for hiding this comment

Uh oh!

bwoebi Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

bwoebi Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

arnaud-lb Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

dkarlovi commented Apr 16, 2024

Uh oh!

devnexen commented Apr 16, 2024

Uh oh!

arnaud-lb commented Apr 17, 2024

Uh oh!

dstogov commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arnaud-lb commented Apr 18, 2024

Uh oh!

dstogov commented Apr 19, 2024

Uh oh!

crrodriguez commented Apr 19, 2024

Uh oh!

dstogov commented Apr 22, 2024

Uh oh!

dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

arnaud-lb commented Apr 15, 2024 •

edited

Loading

dstogov commented Apr 17, 2024 •

edited

Loading