Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abuse of TRY_READ for general error handling can lead to crashes and other problems #2487

Closed
markuspf opened this issue May 24, 2018 · 14 comments
Labels
kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) kind: bug Issues describing general bugs, and PRs fixing them topic: error handling

Comments

@markuspf
Copy link
Member

markuspf commented May 24, 2018

Consider the following code:

DeclareRepresentation( "IsMyOutputStreamRep",
                       IsComponentObjectRep,
                       ["kernel", "socket", "format"] );

MyOutputStreamType := NewType(
    StreamsFamily,
    IsOutputTextStream and IsMyOutputStreamRep );

InstallMethod( WriteAll,
    "output text string",
    [ IsOutputTextStream and IsMyOutputStreamRep,
      IsString ],
function( stream, string )
    local str, out;
    str := "";
    out := OutputTextString(str, false);
    PrintTo(out, string);
    return true;
end );

errstream := Objectify(MyOutputStreamType, rec( format := false ));

OutputLogTo(errstream);

x := 8

with gap-4.9.1:

  ┌───────┐   GAP 4.9.1 of 05-May-2018
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: x86_64-pc-linux-gnu-default64
 Configuration:  gmp 6.1.0, readline
 Loading the library and packages ...
 Packages:   AClib 1.3, Alnuth 3.1.0, AtlasRep 1.5.1, AutPGrp 1.9, Browse 1.8.7, Carat 2.2.2, CRISP 1.4.4, Cryst 4.1.17, CrystCat 1.1.8, CTblLib 1.2.2, 
             FactInt 1.6.2, FGA 1.4.0, GAPDoc 1.6.1, IO 4.5.1, IRREDSOL 1.4, LAGUNA 3.9.0, Polenta 1.3.8, Polycyclic 2.13.1, PrimGrp 3.3.1, RadiRoot 2.8, 
             ResClasses 4.7.1, SmallGrp 1.3, Sophus 1.24, SpinSym 1.5, TomLib 1.2.6, TransGrp 2.0.2, utils 0.54
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'
Syntax error: ; expected in example.g:31
fish: “./gap example.g” terminated by signal SIGSEGV (Address boundary error)

with master

 ┌───────┐   GAP 4.10dev-714-gd6d1b20 of today
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: x86_64-pc-linux-gnu-default64
 Configuration:  gmp 6.1.0, readline
 Loading the library and packages ...
 Packages:   AClib 1.2, Alnuth 3.1.0, AtlasRep 1.5.1, AutPGrp 1.8, Browse 1.8.7, CRISP 1.4.4, Cryst 4.1.13, CrystCat 1.1.6, CTblLib 1.2.2, FactInt 1.6.0, 
             FGA 1.3.1, GAPDoc 1.6.1, IO 4.5.1, IRREDSOL 1.4, LAGUNA 3.8.0, Polenta 1.3.8, Polycyclic 2.11, PrimGrp 3.3.0, RadiRoot 2.7, ResClasses 4.7.1, 
             smallgrp 1.2, Sophus 1.23, SpinSym 1.5, TomLib 1.2.6, TransGrp 2.0.2, utils 0.49
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'
Syntax error: ; expected in ../../tmp/gap-4.9.1/example.g:31
^
gap> 

I have traced this down to the following cause:

Ultimately PRINT_OR_APPEND_TO_STREAM is called when outputting the error message to the output log stream, and it contains a TRY_READ wrapper. In 4.9.1 STATE(NrError) is >0 at this point, so TRY_READ immediately bails. In master @fingolfin moved the increment of STATE(NrError) to after the error message is printed, so the problem doesn't occur.

I suspect this change was not made intentionally to prevent this bug (after all GAP should not be crashing). I still don't understand the TRY_READ mechanism well enough to know what the correct fix is, probably @fingolfin knows better.

Incidentally PRINT_OR_APPEND does not wrap the same code into TRY_READ as the stream variant...

@fingolfin
Copy link
Member

fingolfin commented May 25, 2018

Long rant follows, sadly not yet helpful -- but writing it helped me clear my mind in parts, and perhaps some of it is helpful for some people anyway. I'd like to try and write some documentation based on what I learned one of these days, and then hopefully we can together refine it and learn from each other's partial knowledge of the system to piece it all together. And perhaps e.g. @stevelinton knows / remembers some more things and can add them.

First off, I named TRY_READ badly -- because unlike C++ try/catch, it may never even try to execute its main code block. Namely, as its first action checks if NrError is still zero. If it is not, it skips its code block (resp. goes to the CATCH_READ_ERROR block). Catching errors raised by its main code block is in a sense only secondary.

Back when I turned READ_ERROR into TRY_READ, I simply did not understand things well enough, now I know far more about it (but still I feel it is far too little :-( ). But naming is hard, so I am not yet quite sure how to name it instead -- perhaps TRY_READ_IF_NO_ERROR_YET would be better, but that's rather clunky. Suggestions welcome ;-).

I believe the original purpose of TRY_READ resp. its predecessor READ_ERROR was to deal with parser errors, and also to catch errors raised via ErrorQuit in kernel code resp. via Error() in GAP code, but in a very specific way. For this, it executes the code it wraps only if NrError == 0. And NrError really only gets modified in three places:

  • SyntaxError increments it
  • ClearError resets it
  • ReadEvalError causes it to be incremented inside of any enclosing TRY_READ which "catches" it. And (at least back in the days), this function was invoked both by the kernel's ErrorQuit and friends), and also GAP's Error() function

Originally (as in: back in 1997), with just two exceptions, READ_ERROR was used exclusively inside read.c, and there again with two exceptions in exactly one way: to wrap calls into the interpreter.: once there is a syntax error, we increment NrError. After this point, we should stop interpreting/coding; but we also want to continue parsing, until we come to a point where we can "safely" resume it (so that one broken function in a file does not cause us to skip the whole file). Hence we wrap calls to the interpreter into TRY_READ blocks. Additionally, if the interpreter executes something which raises an error, we catch that error, and again resume parsing w/o interpreting.

And that's essentially how it was and where it stopped; the only uses of READ_ERROR outside of read.c where in gap.c, inside the REPL (to suppress output if there was an error) and in intrprtr.c (to deal with a syntax error in the middle of coding a function). That's it. And the two exceptions in read.c were (and are) in ReadEvalCommand and ReadEvalFile which essentially use it to adjust their return value to signal an error.

As for ReadEvalError, the only way to increment NrError other than using SyntaxError, it was only used in ErrorMode(), the precursor to our modern day CallErrorInner().

Finally, back then, ReadJmpError really was only used for one purpose, and by one thing: READ_ERROR!

But then READ_ERROR resp. TRY_READ started to get used in more places, as did ReadEvalError and ReadJmpError, and at some point the whole error and break loop handling was rewritten (I believe by @stevelinton), and things changed -- and I think at times people used TRY_READ, ReadJmpError resp. ReadEvalError without quite 100% knowing what they were doing (e.g. me... ;-).

And so, I think today, the error handling system of GAP is a rather complicated mystery. I recently started to understand it better, which lead to some fixes where I put TRY_READ calls in places they were "obviously" missing ("obvious" in a very relative sense, only after thoroughly understanding what was going on). But the more I learn, the more confused I feel as to why and how things in GAP's error handling actually works these days sigh. Esp. the (ab)use of ReadJmpError worries me. I wonder if we use it in some places where we shouldn't, resp. don't use it in places where we should, resp. use it "wrong"... it's hard to tell, because of the organic growth of the code, and the lack of documentation which at least gives a hint at how it all is "supposed" to work.

I'll think about the issue and write some more later, but for now, have to go.

@olexandr-konovalov olexandr-konovalov added this to the GAP 4.9.2 milestone Jun 15, 2018
@olexandr-konovalov olexandr-konovalov added the kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) label Jun 15, 2018
fingolfin added a commit to fingolfin/gap that referenced this issue Jul 2, 2018
This imitates a change already in the master branch.
olexandr-konovalov pushed a commit that referenced this issue Jul 3, 2018
This imitates a change already in the master branch.
@olexandr-konovalov
Copy link
Member

A workaround to prevent this crash (which imitates a change already in the master branch) has been added to stable-4.9 in #2596 by @fingolfin. I keep the issue open, it is useful to untangle the error handling system of GAP, but move it to GAP 4.10.0 milestone (maybe the title should be changed, too).

@fingolfin fingolfin changed the title 4.9.1 crashes when error occurs and OutputLog points to a stream which uses a stream Abuse of TRY_READ for general error handling can lead to crashes and other problems Jul 3, 2018
@fingolfin
Copy link
Member

I changed the title. To elaborate a bit further on the problem here: PRINT_OR_APPEND_TO_STREAM (and other kernel functions) are apparently using TRY_READ / CATCH_READ_ERROR as if they were C++ try/catch, i.e., to catch any errors possibly raised inside PrintObj (which can execute arbitrary GAP code). If such an error occurs, then the CATCH_READ_ERROR block closes the output and "re-throws" the error, by using a longjmp.

But "rethrowing" only makes sense if STATE(ReadJmpError) was in a valid state before the call to PRINT_OR_APPEND_TO_STREAM. Now, normally, we are being called from the REPL via FuncPRINT_TO_STREAM resp. FuncAPPEND_TO_STREAM, this is the case: the TRY_READ of the REPL is active, and we return to that.

But if the call to the REPL was Read or something similar, and we are called later as a side effect, as in this example, then TRY_READ blocks inside src/read.c will have executed. And these do not bother to save and restore STATE(ReadJmpError). So instead, STATE(ReadJmpError) refers to the last of those TRY_READ blocks in the reader we executed; but we already left that block, so a longjmp to it will result in a crash. Which is what happens here.

The crash does not normally happen in master because the "rethrow" part does not get activated; but it would be, if the recursive call to PrintObj generated an error. (Also: your output log stream has a problem if it itself triggers an error, as that error would be logged, which would trigger an error, which gets logged, which ...)

But here it gets activated, because STATE(NrError) actually is already non-zero before PRINT_OR_APPEND_TO_STREAM is called. Which means that its TRY_READ never executes its main block, but instead immediately goes to the catch block, which then performs the longjmp into nothing...

So this mess has to be cleaned up.

While looking at that code, I noticed another problem: FuncPrint and PRINT_OR_APPEND_TO contain very similar code; but their TRY_READ only wraps the call to PrintObj. But PrintFunction can also dispatch to a GAP function, and thus might raise an error. And even PrintString1 could do so, depending on the active output stream...

More generally: handling errors while currently handling an error is a rather delicate matter...

@markuspf
Copy link
Member Author

If one replaces the last line of the code at the top by just x, the GAP master branch crashes. This is the cause for gap-packages/JupyterKernel#73

@dimpase
Copy link
Member

dimpase commented Oct 2, 2018

While working on Sagemath interface to libgap, I got bitten by infinite recursion that happens if PRINT_OR_APPEND_TO_STREAM is invoked on a dead stream - then it calls ErrorQuit which ends up calling PRINT_OR_APPEND_TO_STREAM on a dead stream...

Sagemath's (based on GAP 4.8.6) libGAP patches another place that leads to an infinite recursion of this sort, in SyWriteandcheck.

Not sure whether this needs another issue.

@embray
Copy link
Contributor

embray commented Nov 20, 2018

I've spent quite a bit of time understanding GAP's error handling and have found many of the same problems already discussed on this issue. I think overall the situation is not so bad, there are just a few inconsistencies and areas where a bit more care is needed.

I have some ideas for how to improve and clean up the situation, but it will be a non-trivial pile of changes (albeit, I think, not overly complex either). I could take a stab at it and post a PR but it will take me a bit of time to get right, and I don't want to step on any toes or work at cross-purposes with any existing plans you might have for this.

@embray
Copy link
Contributor

embray commented Nov 20, 2018

@dimpase

While working on Sagemath interface to libgap, I got bitten by infinite recursion that happens if PRINT_OR_APPEND_TO_STREAM is invoked on a dead stream - then it calls ErrorQuit which ends up calling PRINT_OR_APPEND_TO_STREAM on a dead stream...

Same 🤣 I documented that one at #3028.

@ChrisJefferson
Copy link
Contributor

The main problem (from experience) with trying to clean up GAP's error handling is that it's quite easy to end up breaking how GAP's break loop currently works. A giant pile of changes is extremely likely to break something important for some user or package, and we can't do a "big bang" change to packages.

Of course, I'm also happy to see work in this area, but I'd definately write out a high-level plan of how you think the behaviour, both internally and externally, would change, before you starting writing code.

@embray
Copy link
Contributor

embray commented Nov 20, 2018

That's not a big problem because for library use the break loop is effectively irrelevant--undesired even. Most of said giant pile of changes would actually be somewhat limited in scope I think, but would just require changes in a few dozen places in the kernel code, but would barely touch package code.

The only two packages I found that really muck with error handling insofar as they mess with the ErrorInner implementation are Browse and SCSCP, and in the latter case it's only in its SCSCP server implementation.

I think that what I have in mind is a still just a tad too fuzzy to give a precise high-level plan. I think what I might do if nobody minds is to hack on it a bit and work out the details a bit better, and then write up a "high-level plan" retroactively once I think I have something that will work. But to be clear, I don't think I will be proposing very vast, sweeping changes: This will mostly be in service just to the libgap API level of things.

@embray
Copy link
Contributor

embray commented Nov 20, 2018

To give a very fuzzy overview:

  • As @fingolfin wrote, TRY_IF_NO_ERROR, formerly TRY_READ, has started being used outside its original context in read.c. That's fine actually, but it needs to be used slightly more consistently. I will probably move its definition out of read.h instead to error.h and implement/document it as the more general exception mechanism that it is actually being used for.
    • Additionally probably make it easier to use recursively and eliminating the need for the manual memcpy calls littered about.
  • Clean up usage of ReadEvalError in places where those "exceptions" are not really properly handled.
  • Ensure that all functions made available through libgap-api.h handle all internal exceptions so that users of the API are never exposed to this exception handling mechanism.
  • Add an option which can be enabled for library usage (possibly enabled by default in GAP_Initialize) that would disable ErrorInner error handling altogether, and instead pass error messages for errors that are not otherwise handled internally to an application-specific callback function that can be used to register that a (recoverable) error occurred in GAP.
    • Possible option for including a stack track as well, in case library consumers wish to print a GAP stack trace.
  • Also rework Panic() to fit into this error callback mechanism in such a way that applications which integrate GAP can recover from GAP panics (perhaps with no guarantee that the GAP kernel itself can be recovered, so there has to be a way to distinguish this case from a "recoverable" error).
  • Handling of syntax errors also needs a bit of reworking.

That's a very high-level, very fuzzy overview of what I have in mind to do. The details will emerge better once I start tooling around in there...

@fingolfin fingolfin added the kind: bug Issues describing general bugs, and PRs fixing them label Mar 21, 2019
@fingolfin
Copy link
Member

Just wanted to mention that things got quite a bit better recently, with the introduction of GAP_TRY/GAP_CATCH and a lot of refactoring. Not saying it is perfect now, but some weird corner cases in code were resolved, and some semantics and assumptions etc. cleaned up

@dimpase
Copy link
Member

dimpase commented Jul 29, 2020

Thanks. Perhaps we can use this in Sage's libgap interface.

@fingolfin
Copy link
Member

@dimpase you shouldn't have to if you are using GAP_Enter and GAP_Leave

@fingolfin
Copy link
Member

I consider this resolved for the time being, in so far as that the code is now (IMHO) much cleaner and clearer and saner. That is, I don't claim it's bug free (I am sure it is not); however, in order to improve it, we need more concrete information on actual issues, as I already fixed the hypothetical ones I could think of ;-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) kind: bug Issues describing general bugs, and PRs fixing them topic: error handling
Projects
None yet
Development

No branches or pull requests

6 participants