Conversation
|
Unsurprisingly the change is very visible in |
iluuu1994
left a comment
There was a problem hiding this comment.
Nice! Only a shallow review, but couldn't find any mistakes. 👍
| #ifdef MULTIPLE_THREADS | ||
| static MUTEX_T dtoa_mutex; | ||
| static MUTEX_T pow5mult_mutex; | ||
| #endif /* ZTS */ | ||
|
|
There was a problem hiding this comment.
| #ifdef MULTIPLE_THREADS | |
| static MUTEX_T dtoa_mutex; | |
| static MUTEX_T pow5mult_mutex; | |
| #endif /* ZTS */ |
You forgot to remove these?
There was a problem hiding this comment.
And the now obsolete usages of the acquire/release macros. I suppose you just did the minimum to draft this for now.
There was a problem hiding this comment.
I didn't plan to remove this and the acquire/release macros as they are part of the "API" of the file, which appears to be a reusable piece of code imported from elsewhere. These macros are documented at the beginning of the file:
Lines 147 to 155 in 077891f
There are many knobs like this in this file, many of which we will never use, like KR_headers.
So I only removed the definition of MULTIPLE_THREADS, and left the default no-op definitions of ACQUIRE_DTOA_LOCK and FREE_DTOA_LOCK.
I don't mind removing their use as well if you think it's better.
I feel that we should eventually replace this code by more modern implementations of strtod and dtoa. It should be possible to implement these without memory allocations. Also I don't know if we still need to support VAX/IBM arithmetic. This feels risky and largely out of scope of this PR however.
|
Already value of looking into supporting musl, amazing work! |
|
@arnaud-lb just curious, any change in the perf improvement since you moved to system allocation ? |
|
@devnexen no, results are the same I switched back to system malloc because |
|
I've never looked into As I understood they implemented their own malloc cache, then added mutexes to make it thread safe...
Anyway, I don't object against this PR. It doesn't make things worse. |
|
@dstogov thank you for the review. I've tried here, but the micro benchmark is 20% slower after removing the freelist (2.5% under valgrind). There is no slowdown on other benchmarks, however. Let me know what you prefer. |
Thanks! I'll take a look on Monday. |
|
@arnaud-lb Im with @dstogov here, the freelists cache is not something you really want to have. it will hide bugs for very little benefit. |
I just asked to test the profitability of Of course, we shouldn't make 20% slowdown even for synthetic tests. |
dstogov
left a comment
There was a problem hiding this comment.
I have to admit that my first impression from dtoa() implementation was wrong.
Its freelists cache implementation makes sense.
Making this caches thread-local also makes sense.
This happens because on ZTS we execute `executor_globals_ctor` which reset the `freelist` and `p5s` pointers, while on NTS we don't. On NTS we can reuse the caches but on ZTS we can't, the easiest fix is to call `zend_shutdown_strtod` when preloading is shut down. This regressed in phpGH-13974 and therefore only exists in PHP 8.4 and higher.
This happens because on ZTS we execute `executor_globals_ctor` which reset the `freelist` and `p5s` pointers, while on NTS we don't. On NTS we can reuse the caches but on ZTS we can't, the easiest fix is to call `zend_shutdown_strtod` when preloading is shut down. This regressed in GH-13974 and therefore only exists in PHP 8.4 and higher. Closes GH-16602.
zend_strtod.cuses a global state (mostly an allocation freelist) protected by a mutex in ZTS builds. This state is used byzend_dtoa(),zend_strtod(), and variants. This creates a lot of contention in concurrent loads.zend_dtoa()is used to format floats to string, e.g. in sprintf, json_encode, serialize, uniqid.In this PR I move the global state to the thread specific
executor_globalsand remove the mutex.The impact on non-concurrent environments is null or negligible, but there is a considerable speed up on concurrent environments, especially on Alpine/Musl. When comparing master to this branch, the frankenphp-demo is sped up 10% under Apache/musl, 20% under FrankenPHP/glibc, and 40% under FrankenPHP/musl. Some synthetic benchmark is 80% faster.
Benchmarks:
I'm using two benchmarks:
/api/monsters.jsonld). In this benchmark, the frankenphp-demo app is setup in dev modeIn 3 separate environments:
Opcache is enabled in the php-cgi and apache benchmarks, otherwise compilation time dominates. It is disabled in FrankenPHP because it is redundant in worker mode.
Bookworm uses glibc, Alpine (3.19.1) uses musl (1.2.4).
Results:
php-cgi -T10,500 frankenphp-demo repeated 5 times:
Also in Valgrind:
php-cgi -T10,5000 json_encode.php repeated 5 times:
Valgrind:
Apache mpm_event mod_php ZTS frankenphp-demo:
Apache mpm_event mod_php ZTS json_encode.php:
FrankenPHP frankenphp-demo: