Skip to content

Conversation

arnaud-lb
Copy link
Member

@arnaud-lb arnaud-lb commented May 31, 2025

Related:

This part takes tail-calling and preserve_none from #17849:

  • Opcode handlers dispatch directly to the next handler by tail-calling, which reduces function call overhead and avoids returning to the executor loop
  • preserve_none reduces register saving overhead in opcode handlers

This also implements JIT support.

Non-dispatching opcode handlers

JIT needs non-dispatching opcode handlers (opcode handlers that return instead of calling the next one). I've tried two approaches for this:

  • Generate a second, non-dispatching, variant of each handler. Use the variant as call_handler
  • Use indirect dispatch similarly to the hybrid VM: zend_op->handler is a function that calls the real handler and dispatches.

I've tried both approach (the first one in this branch, and the second one in master...arnaud-lb:php-src:hybrid-tailcall.

The second approach resulted in a slightly slower VM due to indirect dispatching, and JIT generated more spilling when calling handlers as they clobber all registers.

Therefore I've taken the first approach in this PR.

A 3rd approach would be to control dispatching via an additional handler parameter, or to pass a dispatch function to handlers, but I suspect this would have been slower.

Fixed regs and preserved regs

Thanks to the preserve_none convention, JIT'ed code only has to preserve rbp, which reduces the size of prologue/epilogue. Instead of preserving it, I add it to the set of fixed registers, so it's not used. This results in faster code.

Also, quite conveniently, preserve_none receives its first arguments via registers that are callee-saved in sysv. Therefore we can use the arg1 and arg2 regs as our fixed registers. This avoids moving arg1 and arg2 to SP/IP in prologue, or setting arg1/arg2 when tail-calling other handlers.

Benchmarks:

Benchmark Mode vs base vs gcc vs valgrind
bench.php JIT -4% -0% -5%
bench.php Non-JIT -44% +3%
symfony demo JIT -0% -0%
symfony demo Non-JIT -2.8% -0%

base: Clang build of master, wall time
gcc: GCC build of master, wall time
valgrind: Clang build of master, valgrind instructions

Conclusion: Clang builds are now as fast GCC builds on the Symfony Demo benchmark in both JIT and non-JIT modes.

Issues

  • The preserve_none calling convention is documented as unstable. JIT would break if it changed. I suggest checking this at build time, and disabling this optimization (tailcalling + preserve_none) if preserve_none changed. Edit: added a configure check.
  • Clang currently fails to build PHP when using preserve_none and ASAN, therefore we disable this if ASAN is enabled. Edit: This seems to be fixed in recent Clang versions.

UPGRADING

  . PHP binaries built with Clang>=19 are now as fast as binaries built with
    GCC. The performance of binaries built with other compilers has also
    improved considerably.

TODO

  • x86
  • aarch64
  • IR PRs
  • preserve_none configure check

Comment on lines -436 to +508
ZEND_OPCODE_HANDLER_RET ZEND_FASTCALL zend_jit_func_trace_helper(ZEND_OPCODE_HANDLER_ARGS)
ZEND_OPCODE_HANDLER_RET ZEND_OPCODE_HANDLER_CCONV zend_jit_func_trace_helper(ZEND_OPCODE_HANDLER_ARGS)
{
ZEND_OPCODE_TAIL_CALL_EX(zend_jit_trace_counter_helper,
((ZEND_JIT_COUNTER_INIT + JIT_G(hot_func) - 1) / JIT_G(hot_func)));
zend_jit_op_array_trace_extension *jit_extension =
(zend_jit_op_array_trace_extension*)ZEND_FUNC_INFO(&EX(func)->op_array);
size_t offset = jit_extension->offset;
uint32_t cost = ((ZEND_JIT_COUNTER_INIT + JIT_G(hot_func) - 1) / JIT_G(hot_func));

*(ZEND_OP_TRACE_INFO(opline, offset)->counter) -= cost;

ZEND_OPCODE_TAIL_CALL(zend_jit_trace_counter_helper);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tail-calling was not possible due to the extra arg in zend_jit_trace_counter_helper, so I removed it.

Alternative would be to define these handlers in IR, as we do for the hybrid JIT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand why extra_arg became a problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang ˋmusttailˋ disallows calling functions with a different signature. Presumably this is to garantee portability. This is discussed here: https://blog.reverberate.org/2025/02/10/tail-call-updates.html

@arnaud-lb
Copy link
Member Author

arnaud-lb commented Jun 2, 2025

@dstogov this is still a WIP, but I would like to have your opinion on this. This is the same ideas as #17849, but with JIT support which implies a few things described above. I would like your opinion before handling the items in the todo list.

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea makes sense and the benchmark results look promising.
It's also interesting how it affects the VM code size.

Unfortunately at this moment I can't review a huge path like this in all details.
Anyway, I think it makes sense to finalize and land it.

* https://github.com/llvm/llvm-project/blob/a414877a7a5f000d01370acb1162eb1dea87f48c/llvm/lib/Target/X86/X86RegisterInfo.cpp#L319
* https://github.com/llvm/llvm-project/blob/68bfe91b5a34f80dbcc4f0a7fa5d7aa1cdf959c2/llvm/lib/Target/X86/X86CallingConv.td#L1183
*/
jit->ctx.fixed_regset |= (1<<5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_regs are the registers reserved for the register variables.
I remember there were some problems with RBP usage, but I don't remember the exact problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I add the register only to prevent the RA from using it. This is less expensive than adding it to the set of preserved registers. The same thing is done for the HYBRID VM a few lines below.

Comment on lines -436 to +508
ZEND_OPCODE_HANDLER_RET ZEND_FASTCALL zend_jit_func_trace_helper(ZEND_OPCODE_HANDLER_ARGS)
ZEND_OPCODE_HANDLER_RET ZEND_OPCODE_HANDLER_CCONV zend_jit_func_trace_helper(ZEND_OPCODE_HANDLER_ARGS)
{
ZEND_OPCODE_TAIL_CALL_EX(zend_jit_trace_counter_helper,
((ZEND_JIT_COUNTER_INIT + JIT_G(hot_func) - 1) / JIT_G(hot_func)));
zend_jit_op_array_trace_extension *jit_extension =
(zend_jit_op_array_trace_extension*)ZEND_FUNC_INFO(&EX(func)->op_array);
size_t offset = jit_extension->offset;
uint32_t cost = ((ZEND_JIT_COUNTER_INIT + JIT_G(hot_func) - 1) / JIT_G(hot_func));

*(ZEND_OP_TRACE_INFO(opline, offset)->counter) -= cost;

ZEND_OPCODE_TAIL_CALL(zend_jit_trace_counter_helper);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand why extra_arg became a problem

@arnaud-lb arnaud-lb force-pushed the tailcall branch 3 times, most recently from c8cb9d5 to e334ae4 Compare June 27, 2025 16:42
@arnaud-lb
Copy link
Member Author

@dstogov I've addressed all issues, and all necessary IR changes have been merged, so this branch is now ready to be reviewed again, and hopefully merged :)

Since your last review I've made the following notable change: As adding preserve_none to IR requires longer term changes, I took the decision to not rely on IR for calling preserve_none functions. Instead I call these functions with the default convention in JIT. Arguments are passed via fixed regs, and the return register is the same as the default convention. This is safe because we always tailcall these functions, and only from other preserve_none functions, therefore we don't need to care about preserving regs around these calls.

@arnaud-lb arnaud-lb marked this pull request as ready for review July 26, 2025 13:11
@arnaud-lb arnaud-lb requested a review from dstogov July 26, 2025 13:11
Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my questions...

Comment on lines 1171 to 1174
// Helper with parameter. Must use trampoline dispatch due to
// incompatible signature for tailcall.
out($f, "#undef ZEND_VM_DISPATCH\n");
out($f, "#define ZEND_VM_DISPATCH(handler) ZEND_VM_DISPATCH_NOTAIL(handler)\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeatable redefinition of the same macro looks very weird.
Can't we redefine it once - between generating code for TAIL_CALL and regular CALL handlers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this before VM helpers that take extra arguments, because __attribute__((musttail)) doesn't allow them to tailcall to helpers or handlers with a different number of args.

Currently, VM helpers with extra arguments are interleaved with normal VM helpers and opcode handlers.

Avoiding the repeated redefinition would require to group VM helpers with extra args together, however this may change code locality.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that all TAIL_CALL handlers and helpers should be generated after/before all CALL handlers and helpers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made changes so that we generate handlers in this order:

  • CALL handlers
  • TAILCALL handlers and helpers without extra args
  • TAILCALL helpers with extra args

The undef/define is used only once before the last group.

I feared this would affect the location of these function in memory, but it doesn't, and doesn't affect benchmarks.

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically everything is fine.
See ideas about cleaning and reduction of amount of changes.

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have objections anymore

@arnaud-lb arnaud-lb requested a review from a team August 19, 2025 11:39
@arnaud-lb arnaud-lb mentioned this pull request Aug 19, 2025
7 tasks
Copy link
Member

@edorian edorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No RM objections

@arnaud-lb arnaud-lb closed this in 73b98a3 Aug 22, 2025
@henderkes
Copy link
Contributor

@arnaud-lb you've worked magic on this one. Since LTO works flawlessly without global register variables, Clang (21) actually consistently beats GCC (14) now, for phoronix-test-suite phpbench and my Symfony project in worker mode by 5-10%. That's a crazy improvement, considering that Clang was ~25-30% and 20% slower before, respectively.

Great job!

@arnaud-lb
Copy link
Member Author

arnaud-lb commented Oct 1, 2025

@henderkes Awesome! I hadn't tested with LTO yet, although CPython reported great results with LTO as well. Could you describe how you compile PHP with LTO?

@henderkes
Copy link
Contributor

Sure, I'm maintaining https://github.com/crazywhalecc/static-php-cli and https://github.com/static-php/spc-packages and switched over to zig cc (which bundles a platform agnostic cross-compiling Clang 21) a while ago, until I noticed that it produced much slower binaries than gcc.

The compilation logic can be found in https://github.com/crazywhalecc/static-php-cli, but in a nutshell, we compile static .a libraries and then later use them to compile php and extensions. There isn't really anything special going on for -flto, as it works just fine when global register variables are not used.

The exact CFLAGS/LDFLAGS used with zig are:

-g0 -fpic -O3 -fexceptions -pipe -fstack-protector-strong -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -fno-strict-aliasing -fcf-protection -mtls-dialect=gnu2 -m64 -march=x86-64-v3
-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -Wl,-z,noexecstack -Wl,--build-id=sha1 

With gcc we add {% set specs_cflags = '-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1' %} {% set specs_ldflags = '-specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1' ~ (os == '10' ? ' -Wl,-z,pack-relative-relocs' : '') %}

Currently, results are as follows:

gcc 8.4.13 NTS (system package, uses -O2, otherwise same flags)

PHPBench 0.8.1:
    pts/phpbench-1.1.6
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes [15:42 UTC]
        Started Run 1 @ 15:41:07
        Started Run 2 @ 15:41:33
        Started Run 3 @ 15:41:59

    PHP Benchmark Suite:
        951064
        949161
        950943

    Average: 950389 Score
    Deviation: 0.11%

gcc 8.5.dev ZTS:

PHPBench 0.8.1:
    pts/phpbench-1.1.6
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes [15:49 UTC]
        Started Run 1 @ 15:48:37
        Started Run 2 @ 15:49:03
        Started Run 3 @ 15:49:27

    [8192] Using null as an array offset is deprecated, use an empty string instead in pts_strings:586

    PHP Benchmark Suite:
        987486
        979930
        980166

    Average: 982527 Score
    Deviation: 0.44%

zig 8.5.dev ZTS

PHPBench 0.8.1:
    pts/phpbench-1.1.6
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes [15:27 UTC]
        Started Run 1 @ 15:26:08
        Started Run 2 @ 15:26:30
        Started Run 3 @ 15:26:53

    [8192] Using null as an array offset is deprecated, use an empty string instead in pts_strings:586

    PHP Benchmark Suite:
        1061308
        1062223
        1059743

    Average: 1061091 Score
    Deviation: 0.12%

It's 1010-1020k without LTO. I really wanted to benchmark with gcc 15 as well, but the homebrew/linuxbrew version is broken and redhat hasn't published a gcc-toolset yet. Some dependent libraries also don't compiled under -std=gnu23 yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants