Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: pybind11 2.10.0 - error_fetch_and_normalize changes conflicts with PyPy behavior for PyErr_NormalizeException #4075

Open
3 tasks done
jbarlow83 opened this issue Jul 17, 2022 · 19 comments
Labels
triage New bug, unverified

Comments

@jbarlow83
Copy link
Contributor

jbarlow83 commented Jul 17, 2022

Required prerequisites

Problem description

PyPy 3.8 triggers the newly introduced error message

if (exc_type_name_norm != m_lazy_error_string) {
std::string msg = std::string(called)
+ ": MISMATCH of original and normalized "
"active exception types: ";
msg += "ORIGINAL ";
msg += m_lazy_error_string;
msg += " REPLACED BY ";
msg += exc_type_name_norm;
msg += ": " + format_value_and_trace();
pybind11_fail(msg);
}

MISMATCH of original and normalized active exception types: ORIGINAL OSError REPLACED BY FileNotFoundError: [Errno 2] No such file or directory: 'this_file_does_not_exist'

In PyPy, PyErr_NormalizeException causes the "raw" OSError to be normalized into FileNotFoundError. Which seems fair enough.

It's unclear to me whether PyPy is doing something "naughty" here that violates the expectations set by CPython, or if this new exception handling code is just imposing an expectation that doesn't hold for PyPy. As such I don't know if simply disabling that test is appropriate for PyPy.

CPython does not make the same substitution. There is no error (and the code behaves correctly) with pybind11 2.9.

Reproducible example code

    m.def("test_pypy_oserror_normalization", []() {
        // We want to do "with suppress(FileNotFoundError): open(f)" in C++
        try {
            py::module_::import("io").attr("open")("this_file_does_not_exist", "r");
        } catch (const py::error_already_set &e) {
            // FileNotFoundError
        }
    });
@jbarlow83 jbarlow83 added the triage New bug, unverified label Jul 17, 2022
@Skylion007
Copy link
Collaborator

Ping @rwgk

@rwgk
Copy link
Collaborator

rwgk commented Jul 17, 2022

As such I don't know if simply disabling that test is appropriate for PyPy.

Do you mean #ifdefing out the quoted code? That would break a unit test or two, but the pybind11 test code is already badly cluttered with xfail for PyPy, a couple more is fine with me.

The main reason for the code quoted in the OP is to catch "errors in the error handling" as close to the root cause as possible. See the corresponding unit test:

# PyErr_NormalizeException replaces the original FlakyException with ValueError:

Unfortunately there is no good way to know if the change of exception type is the result of a bug, or working as intended.

My opinion: changing the exception type during normalization is evil. If something is counting on the exception to get normalized, it can do so straightaway. Otherwise, client code will have to accommodate two exception types for the same exception, depending on the control flow. I feel that's like pulling out the rug from underneath people's feet. — Another line of thought: deferred normalization was meant to be an optimization. An optimization is an optimization only if the rest of the logic/control flow works identically with and without.

From the perspective of someone wanting to keep a large-scale environment healthy, I'm very reluctant to remove the quoted code. It is far more likely to catch bugs than to block something good and valid that could not easily be achieved differently. I believe, in this particular case, what's in the interest of keeping a large-scale environment healthy, is also in the interest of the world at large. The quoted code helps catching bugs closer to the root cause, saving a lot of time debugging.

@jbarlow83
Copy link
Contributor Author

I read your comment and examined the code for PyErr_NormalizeException in both CPython and PyPy.

In CPython, PyErr_NormalizeException permits an exception to change its class when normalizing. Nothing about it is treated as an error or warning; it's permitted. After all, any class that implements __new__ might return a different object than its expected instance.

https://github.com/python/cpython/blob/07aeb7405ea42729b95ecae225f1d96a4aea5121/Python/errors.c#L333-369

I agree it's anti-pattern to change the exception type, but I don't think pybind11 is the right tool to be enforcing this rule, especially by treating as a fatal error. It happens to be, as far as I can tell, an implementation detail that PyPy uses NormalizeException to transform OSError into its specific subclasses - CPython doesn't for that code path, but there may be other places where the standard library does similar things. It happens to be the first case that cropped up, but there's a lot of third party Python and C extension libraries that may depend on NormalizeException working the way it normally does in CPython, and will be tripped up if it happens to pass through the pybind11 exception handler.

I think we can also agree that it's better to avoid #ifdef PYPY in pybind11 when it's avoidable, and I think there may be a way:

  • Permit exceptions to change to a subclass of the original type after normalization (which allows pybind11 to do its thing)
  • Fail if the exception type changes to an unrelated class, since this case is more likely to be a bug

@rwgk
Copy link
Collaborator

rwgk commented Jul 18, 2022

pybind11 is the right tool to be enforcing this rule, especially by treating as a fatal error

Maybe there is a misunderstanding, or another bug in your environment (PyPy)? — It's not fatal, but a std::runtime_error that you can catch. See unit test, which does exactly that:

with pytest.raises(RuntimeError) as excinfo:

Oh!

@pytest.mark.skipif("env.PYPY", reason="PyErr_NormalizeException Segmentation fault")

Of course, it's right there!

Note that I ran Google-global testing, exercising 800+ pybind11 extensions many (maybe most, idk) 3rd party, usually that is very conclusive. We've been using PR #1985 Google-internally since June 2 and there we no complaints at all. Let's wait a couple more weeks to see how many more complaints we get. If there is too much pushback, we could make this opt-in vs always enabled. At the moment this bug is the only complaint I know about.

  • Permit exceptions to change to a subclass of the original type after normalization (which allows pybind11 to do its thing)
  • Fail if the exception type changes to an unrelated class, since this case is more likely to be a bug

Sounds good. Do you want to try?

Nothing about it is treated as an error or warning; it's permitted.

I believe they didn't think it through. They wanted to optimize, but forgot to handle the distinction between "accident/bug" and "intent". While working on #1895 I searched the documentation for PyErr_Occurred() conditions, then the source code, and was very surprised to see that that is not handled at all. "... optimization is the root of all evil." In this case not so much "premature" (Knuth), but "rushed", because they opened the door wide to masking errors in the error handling, which I can tell you from first-hand experience, can be a terrible time sink; weeks worth lost of my time, and that's just me.

rwgk pushed a commit to rwgk/pybind11 that referenced this issue Jul 18, 2022
… segfault with unknown root cause.

Change prompted by pybind#4075
@rwgk
Copy link
Collaborator

rwgk commented Jul 18, 2022

FYI: I'm testing #4079 , but it's late here. I'll try to finish the PR asap and will explain then.

@rwgk
Copy link
Collaborator

rwgk commented Jul 18, 2022

pybind11 is the right tool to be enforcing this rule, especially by treating as a fatal error

Maybe there is a misunderstanding, or another bug in your environment (PyPy)? — It's not fatal, but a std::runtime_error that you can catch. See unit test, which does exactly that:

with pytest.raises(RuntimeError) as excinfo:

Oh!

@pytest.mark.skipif("env.PYPY", reason="PyErr_NormalizeException Segmentation fault")

Of course, it's right there!

No, false!

The claim that we are treating a change in exception type "as a fatal error" completely got me on the wrong track. It turns out to be a clean RuntimeError in all PyPy CI jobs, that preserves all the existing information, but explains that the exception type has changed:

https://github.com/pybind/pybind11/runs/7385124173

You could adjust to that by catching RuntimeError in addition to FileNotFoundError. That's a very small inconvenience compared to the huge gain in safety everywhere else (original errors are no longer masked by errors in the error handling).

If PyPy is then changed to normalize the error immediately, instead of relying on pybind11 to do that for PyPy, the world at large will be in a better state than it was before.

Working on this bug reminded me that this

@pytest.mark.skipif("env.PYPY", reason="PyErr_NormalizeException Segmentation fault")

was needed even before #1895 was merged. Fixing the root cause for that segfault would definitely be a good thing. As-is, PyPy appears to mask errors in the error handling with a segfault.

@jbarlow83
Copy link
Contributor Author

Apologies about the "fatal error" comment - I mistook pybind_fail for meaning that the interpreter is terminated.

For pybind11 2.10 replaces an OSError from PyPy with an RuntimeError - I can't agree this is a clean solution. It replaces with the expected exception, an OSError, with a RuntimeError. That is an API breaking change and a regression from 2.9. I don't think it's reasonable to ask client code to introduce special, hacky, PyPy-specifc exception handling everywhere until PyPy addresses hte underlying issue.

I agree with it does make sense to break client code when it's clearly doing something wrong, dangerous and most importantly fixable by client code, but that's not the case here. The code that's presumably broken is in PyPy (or perhaps cpyext). That segfault is also worrying. But client code can't fix the fundamental issues since those issues aren't in client code. I don't think it makes sense to punish client code over the issue.

Everyone who supports PyPy also supports CPython, so any code is raising denormalized polymorphic exceptions it will get flagged by the MISMATCH test on the CPython side. That will prove safety and allow PyPy to work.

I think that leaves two options:

  • report the issue to PyPy/cpyext and for now, use an #ifdef to drop the MISMATCH test from PyPy compiles
  • disable PyPy support from pybind11 if we're that uncomfortable with its software engineering decisions around exceptions

@Skylion007
Copy link
Collaborator

@rwgk Could this segfault be related to the fact PyErr_NormalizeException decrefs the last two arguments, and we don't inrec them? See this comment: #1895 (comment) . It also demonstrates the very small perf penalty of normalizing the eagerly if that would solve the issue.

@rwgk
Copy link
Collaborator

rwgk commented Jul 19, 2022

@rwgk Could this segfault be related to the fact PyErr_NormalizeException decrefs the last two arguments, and we don't inrec them?

Hm... I think we have been calling PyErr_NormalizeException correctly, before and after PR #1895, that aspect didn't change. Any manipulations of the refcount would be PyPy workarounds. I'm not sure it's worth the time experimenting with "something that avoids the PyPy segfault but does not introduce refcount leaks". Someone with a vested interested in PyPy (not me) debugging PyPy would seem like a better use of collective time.

See this comment: #1895 (comment) . It also demonstrates the very small perf penalty of normalizing the eagerly if that would solve the issue.

There was a lot of back and forth on this (even just in my own thinking, but also between reviewers), but in the end we settled on immediate normalization:

// Immediate normalization is long-established behavior (starting with
// https://github.com/pybind/pybind11/commit/135ba8deafb8bf64a15b24d1513899eb600e2011
// from Sep 2016) and safest. Normalization could be deferred, but this could mask
// errors elsewhere, the performance gain is very minor in typical situations
// (usually the dominant bottleneck is EH unwinding), and the implementation here
// would be more complex.

@rwgk
Copy link
Collaborator

rwgk commented Jul 19, 2022

I think that leaves two options:

  • report the issue to PyPy/cpyext and for now, use an #ifdef to drop the MISMATCH test from PyPy compiles

Yes. I'll do the #ifdef part, but hope someone else will do the reporting to PyPy.

  • disable PyPy support from pybind11 if we're that uncomfortable with its software engineering decisions around exceptions

No! :-)

My interest: guard against masking errors in the error handling, because I know from direct experience how bad that can be ("extremely expensive to debug").

But I don't want to break anything that doesn't share that interest, if I can help it, which I believe the #ifdef will do.

For pybind11 2.10 replaces an OSError from PyPy with an RuntimeError - I can't agree this is a clean solution.

I never meant to suggest it is a clean solution, I know it is not. The root of the problem is the missing intent vs bug distinction in PyErr_NormalizeException, leaving us stuck between a rock and a hard place:

  • Original errors getting masked by errors in the error handling (see unit test) == extremely expensive debugging that, in a large scale environment, or the world as a whole, will happen at a certain rate.
  • Turn a change of exception type into a noisy failure.
  • Or something smarter as you suggested before (inheritance), but that's a lot of added complexity, compared to pushing back with the noisy failure, which is basically a way of saying: please normalize such cases immediately yourself (rather than relying on pybind11 to do it for you).

Of course that's no good for someone merely wanting to use pybind11 & PyPy in combination, but not wanting to get into PyPy to add the immediate normalization there. I think the #ifdef is a good compromise.

rwgk pushed a commit that referenced this issue Jul 21, 2022
…#4079)

* For PyPy only, re-enable old behavior (likely to mask bugs), to avoid segfault with unknown root cause.

Change prompted by #4075

* Undo the change in tests/test_exceptions.py

I turns out (I forgot) that PyPy segfaults in `test_flaky_exception_failure_point_init` already before the `MISMATCH` code path is reached:

https://github.com/pybind/pybind11/runs/7383663596

```
RPython traceback:
test_exceptions.py .......X.........Error in cpyext, CPython compatibility layer:
  File "pypy_module_cpyext.c", line 14052, in wrapper_second_level__star_3_1
  File "pypy_module_cpyext_1.c", line 35750, in not_supposed_to_fail
Fatal Python error: Segmentation fault
Stack (most recent call first, approximate line numbers):
  File "/home/runner/work/pybind11/pybind11/tests/test_exceptions.py", line 306 in test_flaky_exception_failure_point_init
The function PyErr_NormalizeException was not supposed to fail
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/python.py", line 185 in pytest_pyfunc_call
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_callers.py", line 9 in _multicall
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_manager.py", line 77 in _hookexec
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_hooks.py", line 244 in __call__
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/python.py", line 1716 in runtest
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/runner.py", line 159 in pytest_runtest_call
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_callers.py", line 9 in _multicall
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_manager.py", line 77 in _hookexec
Fatal error in cpyext, CPython compatibility layer, calling PyErr_NormalizeException
Either report a bug or consider not using this particular extension
<SystemError object at 0x7fcc8cea6868>
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_hooks.py", line 244 in __call__
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/runner.py", line 261 in <lambda>
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/runner.py", line 317 in from_call
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/runner.py", line 246 in call_runtest_hook
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/runner.py", line 218 in call_and_report
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/runner.py", line 118 in runtestprotocol
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_callers.py", line 9 in _multicall
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_manager.py", line 77 in _hookexec
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_hooks.py", line 244 in __call__
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/main.py", line 335 in pytest_runtestloop
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_callers.py", line 9 in _multicall
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_manager.py", line 77 in _hookexec
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_hooks.py", line 244 in __call__
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/main.py", line 318 in _main
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/main.py", line 255 in wrap_session
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/main.py", line 314 in pytest_cmdline_main
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_callers.py", line 9 in _multicall
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_manager.py", line 77 in _hookexec
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pluggy/_hooks.py", line 244 in __call__
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/config/__init__.py", line 133 in main
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/_pytest/config/__init__.py", line 181 in console_main
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/site-packages/pytest/__main__.py", line 1 in <module>
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/lib-python/3/runpy.py", line 62 in _run_code
  File "/opt/hostedtoolcache/PyPy/3.7.13/x64/lib-python/3/runpy.py", line 170 in _run_module_as_main
  File "<builtin>/app_main.py", line 109 in run_toplevel
  File "<builtin>/app_main.py", line 652 in run_command_line
  File "<builtin>/app_main.py", line 996 in entry_point
Segmentation fault (core dumped)
```

* Add test_pypy_oserror_normalization

* Disable new `PYPY_VERSION` `#if`, to verify that the new test actually fails.

* Restore PYPY_VERSION workaround and update comment to reflect what was learned.

* [ci skip] Fix trivial oversight in comment.
@jbarlow83
Copy link
Contributor Author

Sorry for delay in respond.
LGTM.

I'll submit a quick PR to clarify the comments many types of OSError may do this, not just FileNotFoundError.

@jbarlow83
Copy link
Contributor Author

Reported to PyPy over here.
https://foss.heptapod.net/pypy/pypy/-/issues/3786

@mattip
Copy link

mattip commented Jul 22, 2022

Hi. I am really confused. It seems there are at least two issues here:

  • Some C-API call to open a file is setting an OSError on PyPY and a FileNotFoundError on CPython
  • pybind11 calls PyErr_NormalizeException and there is some inconsistency in what PyPy does compared to what CPython does.

Do I understand correctly up to here? If so read on. If not please correct me.

The first one is a PyPy bug, and PyPy (i.e. me) should fix it. Do you know what C-API call is being used?

The second one seems to be that PyPy's implementation of PyErr_NormalizeException will unconditionally create a new error object from type and value, and does not follow the "If the values are already normalized, nothing happens" part of the spec. Is that right?

@rwgk
Copy link
Collaborator

rwgk commented Jul 22, 2022

Hi. I am really confused. It seems there are at least two issues here:

  • Some C-API call to open a file is setting an OSError on PyPY and a FileNotFoundError on CPython

I'm not too sure about the exact internal steps, too, but yes, at the moment when py::error_already_set is constructed:

  • C-Python exc_type is FileNotFoundError
  • PyPy exc_type is OSError

The py::error_already_set constructor calls PyErr_Fetch(), followed immediately by PyErr_NormalizeException(). After the latter:

  • C-Python exc_type is still FileNotFoundError (unchanged)
  • PyPy exc_type is still FileNotFoundError (changed)

It was "always" that way, but PR #1895 changed the behavior of pybind11, for safety. It now does this:

std::string msg = std::string(called)
+ ": MISMATCH of original and normalized "
"active exception types: ";

Instead of the FileNotFoundError, a RuntimeError with that message is raised, explaining what happened. This is meant to catch bugs. See prior comments here for more background.

The first one is a PyPy bug, and PyPy (i.e. me) should fix it. Do you know what C-API call is being used?

I never looked deeper than what I described above.

The second one seems to be that PyPy's implementation of PyErr_NormalizeException will unconditionally create a new error object from type and value, and does not follow the "If the values are already normalized, nothing happens" part of the spec. Is that right?

I don't think that ever was a problem for pybind11, before and after PR #1895.

From my limited understanding / biased viewpoint: PyPy seems to "exploit" PyErr_NormalizeException() to arrive at FileNotFoundError. If PyPy was changed to call PyErr_NormalizeException() "as soon as possible", and not rely on pybind11 to do that as a side-effect (of using py::error_already_set in the processing of the error), everything would be fine.


A related but separate PyPy problem that existed before and still exists after PR #1895:

https://github.com/pybind/pybind11/blob/master/tests/test_exceptions.py#L306

It would be great for PyPy users if that got fixed. As-is, errors in the error handling (the unit test) lead to a PyPy segfault. I'm guessing that could be very time consuming to debug, especially if the error condition is difficult to reproduce, which I've seen quite often for production jobs (as a general observation, I don't think Google is using PyPy in production).

@mattip
Copy link

mattip commented Jul 26, 2022

PyPy now emits the proper normalized error in nightlies, and this will be part of the 7.3.10 release. Thanks for pointing it out.

@mattip
Copy link

mattip commented Oct 4, 2022

I added a PyPy-nightly run to the binary-run repo where I test c-extension packages. ubuntu and windows are passing, macos is failing this test. I wonder what is different about macOS. Using clang instead of gcc/MSVC maybe?

@rwgk
Copy link
Collaborator

rwgk commented Oct 4, 2022

I added a PyPy-nightly run to the binary-run repo where I test c-extension packages. ubuntu and windows are passing, macos is failing this test. I wonder what is different about macOS. Using clang instead of gcc/MSVC maybe?

Yes, most likely: I've only ever seen clang being used when testing pybind11 on macOS. Maybe related: Apple's versioning doesn't seem to match "normal" clang versioning. (It's one of those things I've been confused about but never got a chance to drill down.)

@Skylion007
Copy link
Collaborator

@rwgk Apple's version = Clang version - 2, usually.

@rwgk
Copy link
Collaborator

rwgk commented Oct 4, 2022

@rwgk Apple's version = Clang version - 2, usually.

Interesting ... thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage New bug, unverified
Projects
None yet
Development

No branches or pull requests

4 participants