Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDC 1.3, OS X, -flto=full: Fatal error in EH code: _Unwind_RaiseException #2208

Closed
jondegenhardt opened this issue Jul 15, 2017 · 18 comments
Closed

Comments

@jondegenhardt
Copy link
Contributor

jondegenhardt commented Jul 15, 2017

Seen this error before. Very similar to #1512. Occurs in-conjunction with exception handling being used to process erroneous command line arguments. This time with the tsv-filter program in tsv-utils. So a fair bit of exception handling. Example:

$ tsv-filter --regex z:abc foobar.tsv
Fatal error in EH code: _Unwind_RaiseException failed with reason code: 5
Abort trap: 6

This will result from an exception raised by trying to convert z to a size_t using std.conv.to!size_t. It only seems to happen with a few exception handling cases.

This is with LDC 1.3 on OSX (xcode 8.3.3). Only occurs with optimized builds using -flto=full. Does not get triggered with -flto=thin or having no flto option. Interestingly, it did not occur when Phobos is built with LTO support (as described in #2168). Using -flto=full did not cause a problem in this case. (Note that #2168 involved std.conv.to!double. Possible relationship?) Also, does not occur with LDC 1.2, this is new in 1.3.

I have switched my programs to use -flto=thin. This appears to work fine, so it is not a pressing issue for me. Naturally, my quick attempts to produce a simplified case did not succeed.

For these reasons (and the time spent on #2168), I'm not planning on further investigation unless requested. It'd of course be nice if adding Phobos LTO support addressed this as well as #2168.

@kinke
Copy link
Member

kinke commented Jul 15, 2017

If I recall correctly, we've had a bunch of failing _Unwind_RaiseException calls on ARM. This might be related to #2024 (comment).

@joakim-noah
Copy link
Contributor

There are two reasons this happened on ARM:

In your case, you aren't compiling druntime differently between when you hit this error and you don't? In that case, you might be hitting some version of the unsolved second reason. The only way to know is to dig down into the assembly, as Dan does in that linked forum post.

@jondegenhardt
Copy link
Contributor Author

In your case, you aren't compiling druntime differently between when you hit this error and you don't? In that case, you might be hitting some version of the unsolved second reason.

One test I did was to build LDC turning on "Phobos LTO" as provided to by Kinke in the #2168 thread. I'm guessing that also turns on LTO for druntime as well, though it was not explicitly discussed. Is this what you are referring to?

Either way, inlining seems a reasonable hypothesis. As mentioned, this only occurs with -flto=full, and the build as released (no phobos/druntime LTO). Does not occur '-flto=thin`, no '-flto' option, or when phobos/druntime is built with LTO. LTO does more than inlining, but inlining seems a key piece.

@joakim-noah
Copy link
Contributor

I have never tried LTO, so my question was whether the changes you're trying affect the ldc exception-handling code in druntime, as happens with the first reason listed above that has nothing to do with inlining, or only the codegen for your own D code, as happens with the second reason. It sounds like the latter, but I don't know how pervasive LTO is.

@kinke
Copy link
Member

kinke commented Jul 17, 2017

I'm guessing that also turns on LTO for druntime as well

Yep.

I don't know how pervasive LTO is

It can apparently be extremely pervasive, especially wrt. (cross-module) inlining.

Does not occur '-flto=thin`, no '-flto' option, or when phobos/druntime is built with LTO

Edit: Right, forgot about the fact that it doesn't happen with (full-)LTO-enabled stdlibs. So yeah, a brittle alignment issue seems likely, possibly caused by that LLVM EH handler bug seen on ARM and a special inlining scenario.
LLVM suffers from another EH unwinding issue on Windows/MSVC, not with respect to alignment but with proper non-volatile register restoring. So it seems as if their EH isn't tested very well (at least on more exotic platforms).

@kinke
Copy link
Member

kinke commented Jul 17, 2017

Btw SemaphoreCI used to fail during unwinding for a 32-bit library unittest (#2074 (comment)), which went away when setting the proper default inlining threshold...

@dnadlinger
Copy link
Member

I'm not sure whether it is clear beyond doubt yet that the issue is in LLVM's EH table generation, rather than our personality function code.

@kinke
Copy link
Member

kinke commented Jul 18, 2017

I quickly looked at druntime's ldc.eh.libunwind and noticed

struct _Unwind_Exception
{
    ulong exception_class;
    ...
}

vs.

struct _Unwind_Control_Block
{
    char[8] exception_class;
    ...
}

The former being the general variant and the latter the ARM version. While identical in size, this may influence the alignment (and padded size) of the struct. On a 32-bit x86 target, a D ulong has an alignment of 4, so I hope an alignment of 8 for the struct wasn't the intention behind that.
I then found https://reviews.llvm.org/D22543 with interesting comments:

// The implementation of _Unwind_Exception uses an attribute mode on the
// above fields which has the side effect of causing this whole struct to
// round up to 32 bytes in size. To be more explicit, we add pad fields
// added for binary compatibility.
...
// The Itanium ABI requires that _Unwind_Exception objects are "double-word
// aligned". GCC has interpreted this to mean "use the maximum useful
// alignment for the target"; so do we.

and then this in the current LLVM libunwind source:

long long int :0; /* Enforce the 8-byte alignment */

So I guess an align(8) for both ARM and non-ARM structs (and maybe cross-checking more of these low-level structs) can't hurt.

@joakim-noah
Copy link
Contributor

I'm not sure whether it is clear beyond doubt yet that the issue is in LLVM's EH table generation, rather than our personality function code.

Not only that, but having debugged that issue, the problem there was the way collectExceptionMsg and its arguments were inlined interacted with the exception-handler to cause stack corruption, then the code ran for a bit longer, passed a bad argument because of that corrupt stack to a subsequent call to array(), which then errored out when it tried to enforce that argument is valid, possibly finally dumping out of the exception-handler with a similar message also because of the stack corruption.

In other words, the real issue is stack corruption, which manifests finally in the exception-handler. It so happens in that case that the exception-handler causes the stack corruption in the first place, but Jon's situation might just be stack corruption caused some other way. It needs to be investigated.

maybe cross-checking more of these low-level structs

Yeah, I've noted some small disparities before too, but I doubt they're the issue here.

@joakim-noah
Copy link
Contributor

I just tried to reproduce on linux/x64 with the final official release of ldc 1.3.0 and can't seem to. I extracted the command to build tsv-filter from the output of running the makefile and added the -flto=full option, then ran tsv-filter --regex z:abc dub.json. It errors out normally both with LTO and without. The binary is about 10% smaller with the LTO option, tried it with both ld.bfd and ld.gold, so something different has been done.

@jondegenhardt
Copy link
Contributor Author

jondegenhardt commented Jul 20, 2017

@joakim-noah If you still have the linux tsv-filter built with -flto-full, there are a couple other ways to try to trigger it. Try these commands and see if any fail. All fail on OS X.

$ tsv-filter foo.tsv --ff-reldiff-gt a:b:2
$ tsv-filter foo.tsv --ff-absdiff-gt a:b:2
$ tsv-filter tests/input1.tsv --ff-absdiff-le a:b:2

Why I suggest this, and a potential clue - If you look at the tsv-filter code calling getoptInorder/std.getopt, you'll see a long list of command line options. Nearly all trigger an exception on invalid input. The first 25 that take option args, through istr-in-fld, all work fine. All the options that follow generate the EH trap on OS X. This is 19 options, from istr-not-in-fld to ff-reldiff-gt.

My reason for suggesting trying a few different cases on Linux is that perhaps the behavior might show up at a different point in this chain than on OS X.

The other obvious thing is that there is a clear differentiating line. Having all the initial options work fine and all starting with istr-not-in-fld fail cannot be a coincidence.

If you're wondering what getoptInorder is, and could it be involved in the problem - It's a cover I wrote to std.getopt, which is calls under the covers. It takes the additional step of processing args in the order issued on the command line (run-time order). I'm not aware of any bugs in it, but of course you never know. I'm assuming the same behavior would be seen if the same call was made to std.getopt directly. I haven't tested that though.

@joakim-noah
Copy link
Contributor

I'm unable to reproduce with the three other commands you gave either. Clearly this is a codegen issue related to LTO, not bugs in your code, if it doesn't hit you with other compile options. Of the three ways we hit this on ARM that I listed above, only the first was a problem with the D source, because it used a bunch of casts that didn't take ARM alignment requirements into account.

If Johan or someone else can reproduce on OS X, it looks like this might be limited to that OS. I can't reproduce on linux.

@jondegenhardt
Copy link
Contributor Author

jondegenhardt commented Jul 20, 2017

FWIW, I encountered this both on travis-ci (OS X) and on my own machine, so two environments. The travis build: here. Note that in the travis builds -flto=full is used on OS X but not Linux.

@kinke
Copy link
Member

kinke commented Oct 30, 2017

So this is working with LDC 1.5 and Xcode 9.0.1 now, right Jon?

@jondegenhardt
Copy link
Contributor Author

jondegenhardt commented Oct 30, 2017

Sorry, no. Fails in the exactly the same way, when using -flto=full. Happens when using LTO on just the app code and when using LTO on the runtime libraries too. Thin LTO is fine. LDC 1.5, Xcode 9.0.1.

The interesting property I reported in comment from July 19 still holds. The failure is triggered by exception handlers used in option callbacks a fair way down a long list of callbacks provided to std.getopt.

(Apologies for not reporting this when the 1.5.0 release came out. It doesn't affect my normal testing because I had switched to Thin LTO. It showed up later when I ran the full test suite with Full LTO.)

@joakim-noah
Copy link
Contributor

Does the problem show up on linux now?

@jondegenhardt
Copy link
Contributor Author

This problem occurs only on Xcode with Full LTO.

@jondegenhardt
Copy link
Contributor Author

I'm not able to reproduce these issues with the latest version of the compiler (ldc 1.24.0-beta1). There doesn't seem to be any value in continuing to leave this open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants