Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.8.0-beta1, OS X: Uncaught exceptions using [Thin|Full]LTO+PGO; FullLTO alone #2585

Closed
jondegenhardt opened this issue Feb 19, 2018 · 27 comments

Comments

@jondegenhardt
Copy link
Contributor

This is another variant of #2208 (feel free to close as a duplicate, but I wanted to highlight new info).

One of my tools, tsv-filter, has uncaught exceptions when built with FullLTO (with or without PGO) or ThinLTO+PGO. The latter case is new in 1.8.0-beta1. In prior versions (1.3.0 - 1.7.0) this failure occurs with FullLTO, but not ThinLTO. This is material for my tsv-filter tool because it means that LTO & PGO cannot be used together in 1.8.0-beta1.

As background - The tsv-filter tool has a fair number of delegates (about 45) invoking code that can throw exceptions, plus a top-level exception catcher. Exceptions are triggered when checking command line args. Unit tests provide an invalid input for each. Approximately the first 30 delegates throw exceptions that are caught successfully, the but exceptions thrown by the remaining delegates are not caught.

There is more info in the printed output in 1.8.3-beta1 that may be helpful tracking down the cause. The executable itself exits printing:

Illegal instruction: 4

This is printed to the console, independent of stdout or stderr. The message to stderr follows a form like:

uncaught exception
object.Exception@/Users/jdegenhardt/devtools/dlang/ldc2-1.8.0-beta1-osx-x86_64/bin/../import/std/getopt.d(880): Missing value for argument --ff-lt.
----------------
0   tsv-filter                          0x0000000102750156 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246

The errors are all similar, differing primarily in text describing cause. In particular, all refer to the same object.Thowable.TraceInfo core.runtime.defaultTraceHandler(void*). There was no reference to in output of previous LDC versions. Perhaps clue here.

I'll take a shot at a reduced test case as well.

@jondegenhardt
Copy link
Contributor Author

The reduced test case isn't going to be as easy to find as I hoped. Starting to remember trying to find one before....

@JohanEngelen
Copy link
Member

for our info, this is with LTO'd Phobos+druntime, right?

@jondegenhardt
Copy link
Contributor Author

for our info, this is with LTO'd Phobos+druntime, right?

Yes. Sorry, that should have been in the report.

@jondegenhardt
Copy link
Contributor Author

A reduced case:
getopt_uncaught.d.txt

Was able to reproduce only after including std.getopt. The getopt arg list contains a 51 callback handlers, each of which can throw. The first 33 handlers listed work fine. However, handlers 34-51 fail when a throw is triggered.

Build commands to generate the failure case:

ldc-build-runtime --dFlags="-flto=full" BUILD_SHARED_LIBS=OFF
ldc2 -release -O3 -flto=full -L-L./ldc-build-runtime.tmp/lib -ofgetopt_uncaught.full-lto getopt_uncaught.d

The (contrived) program prints the sum of validated arguments. Validation is that a number is less-than. eg. --lt10 9 is valid, --lt10 11 is not. The result of getopt_uncaught --lt10 3 --lt10 8 is 11. There are 51 options: --lt0, --lt1, ..., --lt50, specified in that order in the getopt call. Causing the first 33 to throw works fine. The remainder fail. Here's a session showing success and failure:

$ ./getopt_uncaught.full-lto --lt30 10
Sum of values: 10

$ ./getopt_uncaught.full-lto --lt40 10
Sum of values: 10

$ ./getopt_uncaught.full-lto --lt30 30
Error: Invalid input '30'. Must be less than 30

$ ./getopt_uncaught.full-lto --lt32 32
Error: Invalid input '32'. Must be less than 32

$ ./getopt_uncaught.full-lto --lt33 33
uncaught exception
object.Exception@getopt_uncaught.d(14): Invalid input '33'. Must be less than 33
----------------
0   getopt_uncaught.full-lto            0x0000000103a6e176 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246
Illegal instruction: 4

$ ./getopt_uncaught.full-lto --lt32 abc
Error: Error processing input 'abc': no digits seen

$ ./getopt_uncaught.full-lto --lt33 abc
uncaught exception
object.Exception@getopt_uncaught.d(13): Error processing input 'abc': no digits seen
----------------
0   getopt_uncaught.full-lto            0x000000010535b176 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246
Illegal instruction: 4

@jondegenhardt
Copy link
Contributor Author

Further reduced.
getopt_uncaught2.d.txt

This simplifies the previous version. As before, getopt take 50 callback functions. However, the functions take no arguments and simply throw a string. As with the previous example, exceptions thrown by options --opt0 to --opt32 are caught, exceptions thrown by options --opt33 through --opt50 are not. Also, passing an option that is not defined results in an uncaught exception, this time the exception is thrown by getopt.

Same build instructions:

ldc-build-runtime --dFlags="-flto=full" BUILD_SHARED_LIBS=OFF
ldc2 -release -O3 -flto=full -L-L./ldc-build-runtime.tmp/lib -ofgetopt_uncaught2.full-lto getopt_uncaught2.d

Example of failures:

$ ./getopt_uncaught2.full-lto --opt0
Exception caught. Exiting: opt0

$ ./getopt_uncaught2.full-lto --opt32
Exception caught. Exiting: opt32

$ ./getopt_uncaught2.full-lto --opt33
uncaught exception
object.Exception@getopt_uncaught2.d(39): opt33
----------------
0   getopt_uncaught2.full-lto           0x000000010805d976 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246
Illegal instruction: 4

$ ./getopt_uncaught2.full-lto --opt50
uncaught exception
object.Exception@getopt_uncaught2.d(56): opt50
----------------
0   getopt_uncaught2.full-lto           0x00000001012c9976 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246
Illegal instruction: 4

$ ./getopt_uncaught2.full-lto --nosuchoption
uncaught exception
std.getopt.GetOptException@/Users/jdegenhardt/devtools/dlang/ldc2-1.8.0-beta1-osx-x86_64/bin/../import/std/getopt.d(789): Unrecognized option --nosuchoption
----------------
0   getopt_uncaught2.full-lto           0x0000000105c70976 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246
Illegal instruction: 4

@jondegenhardt
Copy link
Contributor Author

jondegenhardt commented Feb 20, 2018

One more reduced case:
uncaught_with_reduced_getopt.d.txt

This case puts a copy of std.getopt in the source file, using this rather than the one from the library. A large amount of getopt code was removed and pared down. Only the one type of option used is handled, etc. After this, attempts to remove additional code causes the failure to disappear. My guess at this point is a stack corruption of some kind. This is probably about as far as I can take it. I'm hoping someone else can take a look at this point.

Update: With this reduced case it is no longer necessary to build phobos/druntime with LTO. The case contains all the code needed to generate the failure.

Build and run below. Note that the failure occurs starting with --opt27:

$ ldc2 -release -O3 -flto=full -ofuncaught_with_reduced_getopt.full-lto uncaught_with_reduced_getopt.d

$ ./uncaught_with_reduced_getopt.full-lto --opt20
Exception caught. Exiting: Option-20

$ ./uncaught_with_reduced_getopt.full-lto --opt26
Exception caught. Exiting: Option-26

$ ./uncaught_with_reduced_getopt.full-lto --opt27
uncaught exception
object.Exception@uncaught_with_reduced_getopt.d(51): Option-27
----------------
0   uncaught_with_reduced_getopt.full-lto 0x00000001093dab26 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246
Illegal instruction: 4

$ ./uncaught_with_reduced_getopt.full-lto --opt40
uncaught exception
object.Exception@uncaught_with_reduced_getopt.d(64): Option-40
----------------
0   uncaught_with_reduced_getopt.full-lto 0x000000010b134b26 object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) + 246
Illegal instruction: 4

@JohanEngelen
Copy link
Member

With this reduced case it is no longer necessary to build phobos/druntime with LTO.

I can reproduce with phobos/druntime lto, but not with normal phobos/druntime. Are you sure?

@JohanEngelen
Copy link
Member

I'm hoping someone else can take a look at this point.

Don't have much time to look into it, so I've started dustmiting Phobos with it (reproducing with LTO-druntime).

@jondegenhardt
Copy link
Contributor Author

I can reproduce with phobos/druntime lto, but not with normal phobos/druntime. Are you sure?

Just repeated, yes I do see the problem without LTO of phobos/druntime. (It was good question though, I was in a hurry yesterday).

I'm using the pre-built LDC binary from the github releases page. OS X High Sierra 10.13.3 (17D47). xcode version:

$ xcodebuild -version
Xcode 9.2
Build version 9C40b

Don't have much time to look into it, so I've started dustmiting Phobos with it (reproducing with LTO-druntime).

Thank you!

@jondegenhardt
Copy link
Contributor Author

I can reproduce with phobos/druntime lto, but not with normal phobos/druntime. Are you sure?

The other bit is in that last reduced case I posted the code had gotten very sensitive to modification. For example, there are functions arguments no longer used after reducing the code. However, removing the parameters from the argument list caused the failure to stop. Similarly, there are assignment statement to a variables that no longer had an effect. Removing the assignment statement caused the failure to stop. There were a number of things like this. It might not take much of a change in the environment to trigger a change.

As to why dropping the LTO against phobos/druntime may have worked - After the modifications, there are only a very small number of phobos features being referenced.

@JohanEngelen
Copy link
Member

It's indeed very sensitive to modification. I have a reduced testcase that no longer depends on Phobos, but after #2589 it no longer reproduces this problem (while the original testcase still shows the problem) So the long time dustmiting appears wasted :( :( [I saved a few intermediate results, but haven't tested those yet]

Some findings:

  • Disabling frame pointer elimination (-disable-fp-elim) resolves the issue; with more indepth search, I found that applying @(llvmAttr("no-frame-pointer-elim","true")) on getoptImpl (in the testcase) resolves the issue. I noted that LLVM's optimization turns (some of the) "no-frame-pointer-elim"="true" into "no-frame-pointer-elim"="false... Don't know if this path is leading to the actual cause.
  • Perhaps the stack is overwritten somewhere, but I haven't been able to find out if so yet (or where).
  • The testcase code is supersensitive to failing/succeeding... Things like:
    arg = arg[1 .. $];
    arg = arg[1 .. $];

and the uncaught exception error is gone when changed to

    arg = arg[2 .. $];
  • -fsanitize=address on my reduced testcase without phobos dependency did not show memory errors (but druntime was not built with ASan)

@kinke
Copy link
Member

kinke commented Feb 25, 2018

[Wrt. the no-frame-pointer-elim function attribute, IIRC clang also specifies no-frame-pointer-elim-non-leaf which we don't.]

@jondegenhardt
Copy link
Contributor Author

Verified that -disable-fp-elim resolves the issue in my tools. Works for both ThinLTO+PGO and FullLTO. Hard to get an accurate read on performance impact, but looks to be about 1-5% depending.

I tried adding @llvmAttr("no-frame-pointer-elim","true") to getoptImpl (to the source code used by ldc-build-runtime). The problem remained after building with ThinLTO+PGO. Did not try FullLTO. The larger context may still allow the problem to persist.

@jondegenhardt
Copy link
Contributor Author

Realizing I might not have added the @llvmAttr("no-frame-pointer-elim","true") correctly. I added it to the files source files downloaded by ldc-build-runtime. However, I didn't modify the source in the LDC home directory itself (I used the prebuilt LDC binary). getoptImpl is a template, if it got picked up from the home directory source file then the @llvmAttr attribute wouldn't have gotten picked up.

@jondegenhardt
Copy link
Contributor Author

jondegenhardt commented Feb 26, 2018

Correcting my previous comment: Adding @llvmAttr("no-frame-pointer-elim","true") to getoptImpl fixes the issue for ThinLTO+PGO. I modified the source in both the LDC home directory and ldc-build-runtime retrieved source files. I haven't tried FullLTO yet.

Update: FullLTO and FullLTO+PGO both worked with my apps after adding the @llvmAttr... line to getoptImpl.

@jondegenhardt
Copy link
Contributor Author

Any thoughts on the approach or next steps to take for this? Enough has been learned that I know how to avoid this in my apps (build with -disable-fp-elim or don't use PGO). That allows me to move to newer LDC releases. So, it's not pressing from that perspective.

On the LDC side, debugging would be ideal, but sounds hard. Would it make sense to add the @llvmAttr("no-frame-pointer-elim","true") UDA to getoptImpl, or is that too much of a special purpose hack? Is disabling one of the frame pointer elimination optimization a better approach?

@JohanEngelen
Copy link
Member

I think trying to reduce it again and debugging is the way to go. To fully understand the issue. I don't have time to look into it soon though :/.

@jondegenhardt
Copy link
Contributor Author

Okay, sounds good. Also lets me how to proceed.

@jondegenhardt
Copy link
Contributor Author

jondegenhardt commented Apr 2, 2018

After switching from:

  • xcode 9.2 to xcode 9.3
  • ldc2-1.8.0-beta1 to ldc2-1.8.0

Now my tsv-filter program has this problem with both thinTLO and fullTLO. (Previously it was just fullLTO.) Using both xcode 9.3 and ldc2-1.8.0 are required for the failure to occur in tsv-filter. Building druntime/phobos with LTO is not necessary. However, these combinations no longer trigger an error in the reduce test case I posted. Likely just too sensitive to changes.

Adding -disable-fp-elim to the build continues to fix the problem.

@JohanEngelen
Copy link
Member

(sorry @jondegenhardt , no progress whatsoever on my end on this topic...)

@dnadlinger
Copy link
Member

Stumbled across this, although it's probably not related: https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple&id=71951

@jondegenhardt
Copy link
Contributor Author

@JohanEngelen Not a problem. Unfortunately, I'd have to learn a new skill set to help with the debugging, so I have to defer to others on this.

@kinke
Copy link
Member

kinke commented Apr 22, 2018

@jondegenhardt: Please retest with 1.9-beta1 with LLVM 6.0 and a resume-unwinding change. Also note that LTO-able druntime/Phobos are now available out-of-the-box, just select them via -defaultlib=phobos2-ldc-lto,druntime-ldc-lto.

@jondegenhardt
Copy link
Contributor Author

Hey, progress! 😄Still a couple issues though.

I tested a the cases below (all with 1.9-beta1, Xcode 9.3; macOS 10.13.4):

  • ThinLTO; libs via ldc-build-runtime; PGO - Works!!!
  • ThinLTO; libs via ldc-build-runtime; No PGO - Works!!!
  • ThinLTO; libs via new -defaultlib= option; No PGO - Works!!!
  • ThinLTO; app-only - Fails, uncaught exception
  • FullLTO; libs via ldc-build-runtime - Linker failure [1].
  • FullLTO; libs via new -defaultlib= option; No PGO - Fails, uncaught exception
  • FullLTO; app-only - Fails, uncaught exception

I didn't test the new -defaultlib= option with PGO (building an instrumented build, etc), because it'll take a bit more work to modify my build scripts to do this.

[1] The linker failure (FullLTO, libs via ldc-build-runtime):

/Users/jdegenhardt/devtools/dlang/ldc2-1.9.0-beta1-osx-x86_64/bin/ldc2 -release -O3 -boundscheck=off -singleobj -flto=full -odobj  -L-L/Users/jdegenhardt/open-source/forks/tsv-utils-dlang/ldc-build-runtime.full.v-1.9.0-beta1/lib  -of/Users/jdegenhardt/open-source/forks/tsv-utils-dlang/bin/tsv-filter -I/Users/jdegenhardt/open-source/forks/tsv-utils-dlang/common/src /Users/jdegenhardt/open-source/forks/tsv-utils-dlang/tsv-filter/src/tsv-filter.d /Users/jdegenhardt/open-source/forks/tsv-utils-dlang/common/src/tsvutil.d /Users/jdegenhardt/open-source/forks/tsv-utils-dlang/common/src/tsv_numerics.d /Users/jdegenhardt/open-source/forks/tsv-utils-dlang/common/src/getopt_inorder.d /Users/jdegenhardt/open-source/forks/tsv-utils-dlang/common/src/unittest_utils.d /Users/jdegenhardt/open-source/forks/tsv-utils-dlang/common/src/tsvutils_version.d
0  0x10a1332c0  __assert_rtn + 129
1  0x10a1a185f  ld::tool::OutputFile::addressOf(ld::Internal const&, ld::Fixup const*, ld::Atom const**) + 271
2  0x10a1a31b6  ld::tool::OutputFile::applyFixUps(ld::Internal&, unsigned long long, ld::Atom const*, unsigned char*) + 2970
3  0x10a1a7219  ld::tool::OutputFile::writeAtoms(ld::Internal&, unsigned char*) + 327
4  0x10a19f7de  ld::tool::OutputFile::writeOutputFile(ld::Internal&) + 858
5  0x10a198b78  ld::tool::OutputFile::write(ld::Internal&) + 178
6  0x10a134254  main + 1146
A linker snapshot was created at:
	/tmp/tsv-filter-2018-03-23-015206.ld-snapshot
ld: Assertion failed: (_mode == modeFinalAddress), function finalAddress, file /Library/Caches/com.apple.xbs/Sources/ld64/ld64-351.8/src/ld/ld.hpp, line 762.
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Error: /usr/bin/gcc failed with status: 1

@joakim-noah
Copy link
Contributor

Thought this was interesting:

"we were discussing possibly turning on LTO on Mac, and how that should roughly just be about turning a switch.

Famous last words.

At least, that’s a somehow reasonable assumption. But when you have a codebase the size of Firefox, you’re up for 'interesting' discoveries.

This involved compiler bugs, linker bugs (with a special mention for a bug in ld64 that Apple has apparently fixed in Xcode 9 but hasn’t released the source of), build system problems, elfhack issues, crash report problems, clang plugin problems (would you have guessed that attribute((annotate("foo"))) can affect the generated machine code?), sccache issues, inline assembly bugs (getting inputs, outputs and clobbers correctly is hard), binutils bugs, and more."

@jondegenhardt
Copy link
Contributor Author

A good reminder for me to retest these cases after a release with LLVM 7.

@jondegenhardt
Copy link
Contributor Author

jondegenhardt commented Oct 4, 2020

Using 1.24.0-beta1 I was unable to reproduce the unhandled exception issues. I was testing this because of removal of the -disable-fp-elim option in LDC 1.24.0 (no longer supported in LLVM 11).

I think this issue can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants