Fix: Variadic function cast causing segfaults on Apple ARM64#2708
Conversation
2a88d90 to
43fbd08
Compare
Fix for Segfaults on Apple ARM64. Apple silicon does not support
re-casting functions with fixed params as variadic, and instead
requires the the function pointer to match the signature of the
definition. If they do not match arguments will not be correctly
marshaled, causing the function implementation to look for
parameters in the wrong places. To enable Apple ARM64 support, we
must explicitly cast to the correct function signature instead of
using varargs `...`.
Co-Authored-By: mjc <mjc@kernel.org>
43fbd08 to
af8a2f5
Compare
| bif = (Eterm (*)(Process*, ...)) *pc++; | ||
| t = (*bif)(build_proc, esp-1); | ||
| bif = (BIF_RETTYPE (*)(BIF_ALIST)) *pc++; | ||
| t = (*bif)(build_proc, esp-1, NULL); |
There was a problem hiding this comment.
I think I've got everything updated correctly. The one thing I'm not entirely sure of is if this should be copied into bif_args, if this is an optimization for this specific case, or something else? I'm new to c; so, if I'm understanding this correctly, esp-1 works here without copying into bif_args because esp-1 and bif_args[0] essentially point to the same value and both are pointers so semantically it's all good? Then by not copying into bif_args[0] we save an allocation/step?
And, please feel free to tell me to just google more 😄 !
Thank you both for your time reviewing this, and please let me know if there's anything else I can change, improve, or have missed 😅 here.
There was a problem hiding this comment.
The copying of the arguments to bif_args when there are 2 or 3 arguments is only necessary because the arguments would be in the wrong order if we would pass esp-2 or esp-3, respectively. When there is a single argument, there is no need to copy it because it can't be in the wrong order.
|
Many thanks for working on this. I have added this pull request to our daily builds. If everything goes well, we will merge it within a few days. |
|
Thanks for your pull request. |
Overview
Howdy!
This is a fix for segfaults happening on Apple ARM64. The segfaults are caused by differences in how Apple ARM64 treats variadic functions. x86 will only use the stack when necessary, after filling the available registers; while, 🍎 ARM64 will always places variadic arguments onto the stack regardless of many registers are available. The segfault manifests as what look like errors in bif functions (e.g.
is_integer_1) and do not leave an erl crash dump (only available logs were from the system's crash report).This was originally developed against to
master(otp-24) but has been back-ported here tomaint. The updated build configuration provided by #2700 is required so it's currently included in this branch.I'm running the tests now and will get back with the results when they finish (in the morning).
FWIW, so far, Cowboy's test suite[1] and
cowlibpassed all tests without error on master (otp-24). Using this branch (based offmaint) with Elixir'smaster-otp-23(from asdf I'll get the elixirshalater) Phoenix's test suite passed all tests without error.Issue
The problem itself (seems) to boil down to variadic casts happening when calling bifs in
erts/emulator/beam/erl_db_util.c - db_prog_match:bifpoints to functions fromerts/emulator/beam/erl_bif_op.cwhich have a fixed number of arguments e.g.Casting the fixed arguments to be variadic then causes the registers to be incorrectly marshaled and the VM segfaults.
The Fix
All of the functions in
erts/emulator/beam/erl_bif_op.care defined using a set of macros defined inerts/emulator/beam/bif.hwith the relevant parts being:Using these macros (on 🍎 ARM64), we were then able to declare function pointers in
db_prog_matchmatching the expected function signature:and then call it:
Closing Remarks
I know, after battling these segfaults for a week. this is a very hot path of code execution , and I'll gladly admit I don't have the expertise required right now to know if there is a better approach for this fix or if this poses its own set of problems. This has been way out of my comfort zone (but damn it felt good! I'll be back for more); so, I'd like to emphasize I'm very receptive to feedback. The change in the PR as of opening it, is the simplest and most obvious (?) solution I could muster... and so that's exactly what's here.
I have a gigantic amount of symbolicated crash logs, I'd normally attach to help others without the platform see the issues, but... I'm not sure how much I can and can't share since I'm bound to the 🍎 developer kit NDA. I'm going to ask in their forums later if I can share some of the crash logs here or via email (if Apple needs that) for posterity.
Finally I'm happy, and waiting 😄 , to rebase after #2700 can be backported to
maintas that's a dependency for this patch to work... Thank you for time![1] I'm not really sure how cowboy's test suite ran as it looks (based on the output) like it needs
goand I definitely do not have a working instance ofgoon the machine.Co-Authored-By: mjc mjc@kernel.org