Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ABI incompatibility between 2.9.2 and 2.10 #4105

Closed
3 tasks done
whophil opened this issue Aug 2, 2022 · 36 comments
Closed
3 tasks done

[BUG]: ABI incompatibility between 2.9.2 and 2.10 #4105

whophil opened this issue Aug 2, 2022 · 36 comments
Labels
regression Regression in a recent release triage New bug, unverified

Comments

@whophil
Copy link

whophil commented Aug 2, 2022

Required prerequisites

Problem description

I have two pybind11 extension libraries, A and B, where B uses classes from A.

My library A was compiled with pybind11 2.9 and B was compiled with pybind11 2.10. They both declare the same PYBIND11_INTERNALS_VERSION of 4, yet they do not seem to work with each other. They do work if I compile both A and B with pybind11 2.9.

Based on discussion with @henryiii on Gitter, these should be ABI compatible.

Reproducible example code

N/A
@whophil whophil added the triage New bug, unverified label Aug 2, 2022
@henryiii
Copy link
Collaborator

henryiii commented Aug 2, 2022

From Gitter:

pybind11 2.10 issue

ocp is a package on conda-forge consisting collection of pybind11 extensions for OpenCascade.

I have another closed source project, Project B, which contains its own pybind11 code, using classes from ocp.

In the past, this all worked fine so long as both projects were built using the same compiler version and pybind11-abiversion, using those from conda-forge-pinning.

Recently, however, the build for Project B started breaking. This seems to be related to the release of pybind11 version 2.10. If I build Project B with pybind11=2.10 the two extension libraries can't talk to each other. If I pin the build of Project B to pybind11=2.9.*, which is the version that the ocp package was built with, then everything works fine.

I thought that this is what thepybind11-abi global pin was supposed to solve. Is it possible that pybind11-abi is not sufficient for this use case? If so, should pybind11 itself be a global pin as well?

@rwgk
Copy link
Collaborator

rwgk commented Aug 2, 2022

We need something to work off from, ideally a reproducer, but at least a careful description of the steps and symptoms to reproduce. I think I've seen all changes between 2.9 and 2.10, but I wouldn't know how we may have accidentally broken ABI compatibility. There is also a chance that it only worked by luck before, and some small perturbation makes it fall over now. ABI compatibility is very tricky to be certain about, especially because we don't even have any directly related unit testing (I believe it's impractical).

@henryiii henryiii added the regression Regression in a recent release label Aug 8, 2022
@whophil
Copy link
Author

whophil commented Aug 9, 2022

Thanks @rwgk . I can put together an example using ocp, but it's a massive project so I don't know if it will be the best candidate for debugging.

@rwgk
Copy link
Collaborator

rwgk commented Aug 9, 2022

Thanks @rwgk . I can put together an example using ocp, but it's a massive project so I don't know if it will be the best candidate for debugging.

Sorry I sure won't have time to debug this.

I'm just hoping for some concrete clues. If that doesn't lead to easy insights, I'd lean toward bumping the internals version. We are holding back an improvement for some time already, behind an ifdef. If we now have an unintentional break, I'd say let's make that a trigger for a version bump rather than spending time on looking back.

Sooner or later ABI breaks are unavoidable.

@henryiii
Copy link
Collaborator

@whophil Could you check this again with current master? #4100 might be related.

@henryiii
Copy link
Collaborator

I can put together an example using ocp, but it's a massive project so I don't know if it will be the best candidate for debugging.

Honestly, even if you have a massive project, as long as it can compile in finite time, and can be compared with live checkouts of pybind11, I could probably fire and forget a git bisect and detect where it broke. Currently it's a handwavy "something broke in my library".

@whophil
Copy link
Author

whophil commented Oct 22, 2022

Thanks @henryiii - I guess you mean that I can build Project A with pybind11 2.9.2, and Project B with pybind11 on current master and see if the ABI compatibility is resolved?

@henryiii
Copy link
Collaborator

henryiii commented Oct 22, 2022

Yes. If it's not, if you could provide an ocp example I can test, I can run a got bisect and see where it started failing.

@rwgk
Copy link
Collaborator

rwgk commented Oct 22, 2022

I have a suspicion that PR #1895 could be the reason. When I merged it I was thinking we're safe, but in the meantime I did some unrelated work (#4043, #4050) and what I learned made me think maybe not.

That PR cannot easily be rolled back, even locally (git revert a05bc3d2359d12840ef2329d68f613f1a7df9c5d has merge conflicts). It's definitely impossible to roll back on the smart_holder branch, because the original error_already_set is too unsafe for Google's production environment.

I wanted to let you know this suspicion before you spend time on a reproducer.

I really want to know, too, what the root cause is, but then we will have the question:

  • backtrack
  • patch
  • increment the ABI version

What is better overall? I think backtracking could be extremely disruptive. Patching will take effort to figure out & validate. Incrementing the ABI version, after many years, is a clean, simple move.

The fact that nobody has given us a reproducer suggests to me that ABI compatibility is important, but not actually critical.

Ultimately: We don't have any unit testing for ABI compatibility. That would be a useful contribution to make for someone with a vested interest, protecting us from accidents in the future.

@whophil
Copy link
Author

whophil commented Oct 22, 2022

Thanks @henryiii and @rwgk !

Just to provide some clarity on my own use case - I don't particularly care if versions 2.9 and 2.10 are ABI compatible.

However, it would be nice to know when an ABI-compatibility breaking change has been made. In the conda-forge feedstock for pybind11, the PYBIND11_INTERNALS_VERSION is used to ascertain which versions are ABI-compatible. As long as this number is bumped when necessary, the dependency mechanisms of conda-forge should handle the rest.

To this end, the unit testing suggested by @rwgk would be more personally useful, rather than attempting to make 2.10 ABI-compatible with 2.9.

@henryiii
Copy link
Collaborator

This is the one and only report of ABI breakage. I've worked through the other two issues I thought might be ABI breakages, and both of them were something else. Without any example code or reproducer, I think this one issue of private code with no reproducer is not enough evidence to say 2.10 has broken ABI. We have many major users of our ABI that haven't complained. I am hoping it's because they upgraded and were fine, and not because they haven't upgraded for other reasons! I think we should continue with 2.10.1, bump the ABI for 2.11, which will follow not too far in the future, and move on. We really should add an ABI test in the future.

In the conda-forge feedstock

This is only true if you use the pybind11 conda-forge package, which many packages do not. I checked; SciPy does, while PyTorch does not. Also, Conda-forge is not the only packaging ecosystem. Many packages ship on PyPI and there's no ABI control there. If we bump ABI, every single PyTorch extension must be recompiled when PyTorch picks up our ABI bump. They will not be cross compatible anymore. Something users have been able to rely on since 2.2-2.4 or so. Also, @rwgk wants to allow selectable ABI with one release having several choices, which totally breaks the conda-forge model of a single ABI. So it's nice, but ABI bumps are not trivial things for a large number of pybind11's users just because conda-forge has a nice mechanism for it. ;)

@rwgk
Copy link
Collaborator

rwgk commented Oct 23, 2022

This is the one and only report of ABI breakage. I've worked through the other two issues I thought might be ABI breakages, and both of them were something else.

Thanks for doing that! Do you have the issue numbers handy?

Also, @rwgk wants to allow selectable ABI with one release having several choices

Are you referring to #4218? What's in there is barely more than floating an idea, I defended it a little bit when @jbms suggested we can bundle the GIL-related internals change under version 5, but if there is more pushback I'll quickly give up. There will be another internals change under PR #4254, I already thought maybe my "increment by one for each modification" idea may get out of hand too easily.

We already have selectable ABI versions, 4 or 5, since PR #3275 was merged on Sep 20, 2021. Google-internally we're running with 5 since the same day (an automatic patch is applied to internals.h when we're importing the smart_holder branch). This was after the 2.7.1 release on Aug 2021, but before the 2.8.0 release on Oct 4, 2021.

I.e. the selectable ABI situation already exists for over a year in released versions. Google cannot wait a year+ for external internals version bumps. Some sort of easy switch is essential for us. I'm not aware of any issues caused by the mere presence of the switch. I believe just lumping everything that comes up together under one increment, until the default is bumped, will work fine. It's just a little less obvious how many independent modifications were made, but we can dig that up from the git history without too much trouble.


Coming back to the problem reported here, my suspicion under #4105 (comment) is related to PYBIND11_EXPORT_EXCEPTION:

pytypes.h:class PYBIND11_EXPORT_EXCEPTION error_already_set : public std::exception {

detail/common.h:class PYBIND11_EXPORT_EXCEPTION builtin_exception : public std::runtime_error {

detail/common.h:#define PYBIND11_RUNTIME_EXCEPTION(name, type)                                                    \
detail/common.h:    class PYBIND11_EXPORT_EXCEPTION name : public builtin_exception {                             \

I'm beginning to think if that's actually behind the problem reported here, it is not an internals-related ABI break, but a straight-up ODR violation. If that's true, any change to the definitions of error_already_set, builtin_exception, or any of the PYBIND11_RUNTIME_EXCEPTION expansions, between two pybind11 versions, strictly speaking means that mixing them causes ODR violations. — Note that this is just a theory at the moment, I could be completely off. Tagging @laramiel and @amauryfa to ask if that makes sense to them.

Also note that it is really tricky to get a handle on ODR violations. How those play out is extremely platform dependent. I'm not aware of reliable tools for detecting ODR violations. There is this in ASAN, but I know for sure that it does not detect something as severe as a pybind11::detail::type_caster ODR violation (e.g. including pybind11/stl.h & pybind11/stl_bind.h, using PYBIND11_MAKE_OPAQUE for a given type in one translation unit, but forgetting it in another). That doesn't mean ASAN also does not detect PYBIND11_EXPORT_EXCEPTION-related ODR violations, but I don't know, and wouldn't bet on it. IOW, I don't know how to get from guessing and suspecting to a firm conclusion.

@rwgk
Copy link
Collaborator

rwgk commented Oct 23, 2022

This is the one and only report of ABI breakage. I've worked through the other two issues I thought might be ABI breakages, and both of them were something else.

@henryiii could you please let me know the two issue numbers?

I'm thinking of working on a demo for the suspected ODR. I expect that it will be a non-trivial effort. I'd like to look at the two issues to see if there is anything that could be related and maybe help me.

@jbms
Copy link
Contributor

jbms commented Oct 23, 2022

Without any information to reproduce the issue or even a deception of exactly how the two libraries are failing to work together, it is difficult to do much with this report. Even a large reproduction would probably make it easy to identify the problem.

The exception types would only create potential abi incompatibility if they are possibly thrown from one extension and caught in another. I haven't yet analyzed the internals data structures enough yet to see if that can happen. If it can, then the exception types would indeed be part of the abi. Going forward, it would be better if the internals data structures didn't involve the pybind11 exception types.

@rwgk
Copy link
Collaborator

rwgk commented Oct 23, 2022

It's only relatively recently that we added PYBIND11_EXPORT, then changed that to PYBIND11_EXPORT_EXCEPTION:

PR 2999 added a "Note" that makes me think only some environments actually need it, near the bottom of this section:

Making a very big jump from here: @whophil could change pybind11/detail/common.h to replace this entire block

#if !defined(PYBIND11_EXPORT_EXCEPTION)
#    ifdef __MINGW32__
// workaround for:
// error: 'dllexport' implies default visibility, but xxx has already been declared with a
// different visibility
#        define PYBIND11_EXPORT_EXCEPTION
#    else
#        define PYBIND11_EXPORT_EXCEPTION PYBIND11_EXPORT
#    endif
#endif

with just

#define PYBIND11_EXPORT_EXCEPTION

Does that resolve your issues?

@rwgk
Copy link
Collaborator

rwgk commented Oct 23, 2022

I haven't yet analyzed the internals data structures enough yet to see if that can happen.

I think PYBIND11_EXPORT_EXCEPTION completely by-passes the internals.

I believe #2999 unintentionally punched a hole into the ABI protection layer (sloppily speaking).

#2999 was merged between pybind11 releases v2.6.2 (Jan 26, 2021) & v2.7.0 (Jul 16, 2021), but I don't expect issues between those because v2.6.2 doesn't have the hole yet.

I also don't expect issues between following releases, assuming we didn't touch the definitions of the exported exceptions.

But 2.10 changed the definition of error_already_set. That would explain why we're seeing the consequences of #2999 only now.

Still a lot of guessing here. @whophil if you could help with experiments from your end that would be extremely valuable.

@jbms
Copy link
Contributor

jbms commented Oct 23, 2022

Regarding #2999 --- the original report in #2847 didn't involve pybind11 exception types at all, and changing the visibility of the pybind11 exception types would have no effect there. However, #3016 changes the original example to inherit from a pybind11 exception type.

There are really 2 forms of ABI compatibility that we need to consider:

  1. C++ library A that does not itself use any pybind11 headers, pybind11 extension PA that wraps A, C++ library B that depends on A but not PA and not pybind11, pybind11 library PB that wraps B and imports PA as a Python library does not include any C++ headers from PA.
  2. C++ library A that depends on pybind11. C++ library B that depends on A and pybind11.

Case 1 is what I have normally considered when thinking about ABI compatibility --- the only interaction happens through the internals data structures, so as long as they are compatible it is okay if PA and PB use a different version of pybind11 headers. (But note that I have not yet confirmed that exception types never pass through functions registered in the internals data structures.)

For case 2, essentially any pybind11 type could be used across an ABI boundary, and therefore it requires a much stronger form of ABI compatibility.

I think we should say that in case 2, library A and B must use exactly the same version of pybind11. Unfortunately, I don't know of a way to try to enforce this restriction automatically, so that an error is produced if it is violated.

@henryiii
Copy link
Collaborator

henryiii commented Oct 24, 2022

could you please let me know the two issue numbers?

They are already linked above on Aug 24:

#4150 & #4137. Both turned out to not be ABI breakages, but something else. I don't think they are helpful in debugging. This issue is the only report of ABI breakage we have, and it's for private code we can't see.

The exception types would only create potential abi incompatibility if they are possibly thrown from one extension and caught in another.

It's quite possible this is the problem. Many, I might even postulate most, pybind11 extensions don't use ABI compatibility at all - it's a nice but slightly niche feature (at least between separately compiled extensions - I'm not counting multiple extensions in one package, as those are always compiled with the same version of pybind11 anyway so it doesn't matter in this context). Of those that do use it (like PyTorch and several other ecosystem-like packages - due to pybind11's usage even a few percent is potentially hundreds of packages), it might be quite rare to produce and catch errors across extensions. If this is true, then 2.10 is only an ABI breaking release if you fit into this narrow category.

Our ABI support is currently a best-effort project, and it's held up pretty well from 2.4-2.9/2.10. It's extremely useful for our users to be able to upgrade pybind11 without worrying about consequences - for most users, they want to use the latest Python version and such but don't want to have to force their ecosystem to upgrade. For example, PyTorch would have to force for all dependent packages to upgrade pybind11 if they upgraded past an ABI boundary. They've pushed us in the past to make a way to make ABI protection looser, not stronger.

And our ABI protection fixes nothing (not counting systems that use it like conda-forge to do migrations). It just replaces a potential for a bad error with an arguably better/arguably worse error. (Specifically, it simply won't work because it won't see the extension as compatible - someone developing an extension might actually prefer a segfault over it simply not seeing it at all - at least then you have a clue it's configured correctly but incompatible). Specifically, "only matching versions" would be useless - someone can do that already by manually forcing matching pybind11 versions. But that means you fully control the stack - which is not the case that we are trying to address with ABI versioning in the first place.

Repeat: You only get a different error with the ABI number (outside of conda-forge's ABI pinning & migrations!). Bumping the ABI number doesn't somehow mean you can use 2.4-9 and 2.x together, it just they don't work together at all - avoiding a segfault or similar, but not "fixing" anything. I very much doubt we'll convince PyTorch and OCP to update to 2.x with a new ABI soon and force them to migrate all dependent users on our schedule. So unless we know the ABI is broken beyond repair, then continuing to support 2.10.x will benefit most of our users!

In an ideal world, my proposal would be:

  • Make a 2.10.1 release tomorrow at 1:00 PM Eastern (at the Python 3.11 release party).
  • Include a very minor note about potential ABI incompatibility
  • Maybe / probably mention we are going to move forward with a forced ABI bump soon
  • Start adding the 3-6 different cleanups and projects stalled waiting for an ABI bump
  • Release pybind11 2.11 or 3.0 with the combined version bumps. Possibly keep bumping the version for a release or two until we are stable again. Right after a Python version release is a great time to break ABI, actually.
  • Possibly redesign the ABI to be more restrictive / easier to maintain / better at ignoring compiler version and such
  • Implement tests for ABI compatibly so we are better protected in the future. ABI compatibility is still not any better than a best-attempt guess without testing.

Of course, we don't have a developer dedicated to the ABI - Google doesn't care about an ABI since they control the stack. I'm not full time on pybind11, my focus is on scikit-build for the next few years (and teaching for the next few weeks). Etc. So we'll probably do a subset of the above, but I think it's a good plan in general.

In summary, I firmly believe 2.10.1 will be the best version of pybind11 we've ever released. It's the culmination of a lot of hard work, and real-world usage and bug reports. There might be some small concern about ABI compatibility, but it's no worse than 2.10.0, which generated only this one bug report. There might be some concern about GIL reentry, but it's no worse then 2.9, 2.8, 2.7, 2.6, 2.5, 2.4, and many versions before that too - it's simply not supported yet (and 2.10 offers a possible workaround, so that's a net improvement). In basically every other way, it is a clear upgrade.

Also, I think we should always be willing to provide a patch release if one is needed. Patch releases never promise to every known bug, they are just provide small iterative fixes for whatever bugs we can address.

@rwgk
Copy link
Collaborator

rwgk commented Oct 24, 2022

Thanks @henryiii for the issue numbers. Good for me to be able to start from a certain place.

I looked at #4137 and closed it after confirming conclusively it's a GIL-not-held issue. Hopefully #4246 will stave off such bug reports in the future, after we merge it. Similarly, I closed #4150 after I realized and explained that it's a DECREF-after-interpreter-finalization.

Same conclusion as yours: nothing to do with an ABI break, for sure.

Google doesn't care about an ABI since they control the stack.

That's only true internally. Google has a big OSS presence. We do care. At least I do, in particular because I'm the one who's pushing the PyCLIF-pybind11 integration, with the promise that we connect better to the OSS world.

I want to take the PYBIND11_EXPORT_EXCEPTION concerns very seriously. If my current understanding is correct (comment above), and with some more guessing, I'm beginning to think that #2999 solved one problem (call it Problem-A, that I cannot define exactly at the minute TBH) but created another one, this one here, call it Problem-B, although I also don't know that with certainty. Now:

  • What if that's true, and we have to tell people: You can have solution to Problem-A, or Problem-B, but not both at the same time?
  • Potential upshot: if PYBIND11_EXPORT_EXCEPTION sets default visibility: we may need to std::terminate() when we detect a pybind11 version mismatch (e.g. < 2.10 vs >= 2.10), to avoid ODR-related undefined behavior.

2.7.0 was the release that brought in the suspected compatibility trap, but only 2.10.0 triggered it, via #1895. We don't know that with certainty, but we have a very strong suspicion. I feel it will be wise to not make a public move before we at least hear back from @whophil (trying the empty PYBIND11_EXPORT_EXCEPTION define).

@henryiii
Copy link
Collaborator

You are welcome to take PYBIND11_EXPORT_EXCEPTION concerns very seriously; we can release a 2.10.2 if we come up with a further fix. This is a situation that is so complex no one can even write a test for it in the two and a half months this issue has been open. 2.10.1 is a solid improvement, and "abandoning our 2.10 users" is out of the question. Due to the fact users have to upgrade to 2.10.0+ to support Python 3.11, many of our users have already upgraded (like SciPy, MatPlotLib, etc). So we need to support them with fixes and not abandon them. And if bumping the ABI is as serious as I think it is, many users may be stuck on 2.10.x for years to come.

Releasing 2.10.1 does not mean we promise to fix every known bug (otherwise, we have 400+ open issues someone needs to work on). It means we fix some known bugs. And we have. A Python 3.11 embedding bug, a MSVC 2019 C++14 bug, an NVCC 11.4-11.8 bug, a PyPy fix, a C++23 bug, a vcpkg fix, A Python 3.12 fix, and a pkg-config file added. Combined, these affect far more users than PYBIND11_EXPORT_EXCEPTION. I've been promising on these various issues the we would release soon; I've had a plan to release before 3.11 since August. Most of these fixes have already waited too long to be released; we should make patch releases more often, not less! We should not delay patch releases because we want to stuff more fixes in, but only because we want to make sure they are stable and not adding new breakages.

I won't release 2.10.1 without at least other maintainer's approval, and I'd really like @rwgk's approval.

@rwgk
Copy link
Collaborator

rwgk commented Oct 24, 2022

So we need to support them with fixes and not abandon them.

Why slip into such radical language?

I just want to wait a day or two to hopefully hear back from @whophil.

If he confirms that the empty PYBIND11_EXPORT_EXCEPTION define fixes his issues, we need to let the 2.10 users know. Consequences of ODR violations tend to be extremely confusing and time consuming to debug.

@henryiii
Copy link
Collaborator

henryiii commented Oct 24, 2022

Why slip into such radical language?

I'm just quoting you. You have said we need to abandon the 2.10 release series. Multiple times (#4218 (comment), #4125 (comment)). If we could "yank" the 2.10.0 release, that would be an option. But we can't - it's the only 3.11 compatible release, and already in heavy use. I've moved over every project I work on to it, and many others have done the same.

This issue is two and a half months old. I've been saying we would make a 2.10.1 release before 3.11 for months. I've let it slip till the last possible second (today is release day). I don't think one issue with one unverified problem should stop the release. We can always make a 2.10.2! Numbers are cheap. Patch releases should be frequent (say monthly).

Consequences of ODR violations tend to be extremely confusing and time consuming to debug.

Our time matters too. I've spent a lot of time building the other issues's code to verify they were not ABI issues. I can't work on this one because there's nothing public to work on. If someone hits this, hopefully they will open an issue with public code that we can work on.

a day or two

If you can set a date, I disagree that it's necessary to make a patch release with unrelated fixes, but I'll agree to wait till then - I'm not releasing without maintainer sign-offs. I had already set a date and nothing happened before that date.

@rwgk
Copy link
Collaborator

rwgk commented Oct 24, 2022

"Google doesn't care" and turning "abandon a release series to protect users" into "abandon users" will not get us anywhere good.

Our time matters too

I know. I'll focus on doing homework now, similar to what I did with PRs #4022, #4050, and #4072.

Please don't try to coerce me into approvals while I figure this out, by suggesting carelessness and bad intentions. Abandoning users is of course irresponsible.

Yesterday already I sent a comment that the developments here ("it's not an internals break, but ODR") made me think releasing 2.10.1 could be a good idea. I do want to get more clarity though, before sending 1000s of people into upgrading. I'll respect the votes of other maintainers if they have a different opinion.

@henryiii
Copy link
Collaborator

henryiii commented Oct 24, 2022

"Abandon a release series to protect users" is abandoning users. If users have already moved to 2.10 (as many of them have), refusing to release important bug fixes because you are afraid they will hit a bug you can't fix and then associate your name to that bug is abandoning them. They can't downgrade to 2.9 without losing Python 3.11 support! All software might have bugs. We have to try to do our best on anything we maintain, and accept there will be things we can't address every release.

Even if there was a major ABI incompatibly, a large fraction of our users don't care about ABI compatibility and those users matter too. The ABI compatibly feature of pybind11 is a powerful but rather specialized feature that we currently don't even test. A few users really care and use it. Just like a few(er number of) users really care about and use the GIL dissociation feature. And 2.10.1 wouldn't have been any worse than 2.10.0, it just would have been the same.

We've now missed the Python 3.11 release party so I don't care anymore. My work was for nothing. Do what you want. I have to go back to teaching.

@rwgk
Copy link
Collaborator

rwgk commented Oct 24, 2022

We've now missed the Python 3.11 release party so I don't care anymore. My work was for nothing. Do what you want. I have to go back to teaching.

I'm really sorry you feel that way. I did review #4279 as quickly as I could, wrote that I'm not opposed to the release, based on believing we're probably up against an ODR issue that cannot be fixed by bumping the internals version (contrary to my belief a few days ago), I explicitly encouraged other maintainers to vote their mind, and there is the approval from @Skylion007.

Me and everybody makes mistakes. I believe it's important that everybody acts to the best of their knowledge, we vote, and go with that, averaging out individual mistakes. That process worked out in your favor, I totally respect that, as I wrote before.

TBH, I only have a very crude idea of what the Python 3.11 release party is, I'm still not sure, I didn't give that a lot of weight.

I'm quite often in a situation where I have to decide: make a move? or more due diligence? I got burned pretty badly a few times moving too fast internally, sending some teams into distress. Each time I decided to go a notch slower in general. That's what I'm still going by w.r.t this bug.

I also want to add that I don't feel great about being that one who merged both #2999 and #1895, overlooking the ODR potential arising from the combination of the two.

@whophil I made what I was hoping could be a reproducer based on my guessing: but the one time that I'm hoping the CI doesn't pass, it passed very easily:

This has happened to me many times before, guess-constructing a failure is a little bit like predicting the next lottery. If there is any chance you could try out my very simple suggestion, that could be an extremely valuable clue. Reducing from something big often looks daunting at first, but is a much safer (and in aggregate more efficient) way of finding root causes.

I'll keep experimenting and asking around until I find a way to tease out observable failures caused by the ODR in #4283, or I convinced myself that the combination of #2999 & #1895 is actually inter-version-ABI-safe on all platforms.

@whophil
Copy link
Author

whophil commented Oct 25, 2022

Hi all. Sorry I haven't been able to follow all the action on this thread!

I worked on creating a reproducer over the weekend, but am having trouble producing a minimal one. The real private project is still failing when I build it with pybind 2.10 and try to use it with a pybind 2.9.2-compiled shared library, so there is definitely an issue. I think the jury is out on whether it is actually an ABl issue - I suppose that's what the reproducer aims to find out.

I will work on the reproducer as much as I can, but can't make any guarantees on when it will be done.

@rwgk
Copy link
Collaborator

rwgk commented Oct 25, 2022

@whophil If I understand your situation correctly, you're building from 2.10, could you please try this tiny side-project, with everything else exactly as you have it already?

Simply apply this one-line change:

diff --git a/include/pybind11/detail/common.h b/include/pybind11/detail/common.h
index 1da323f3..c7fd5145 100644
--- a/include/pybind11/detail/common.h
+++ b/include/pybind11/detail/common.h
@@ -102,7 +102,7 @@
 // different visibility
 #        define PYBIND11_EXPORT_EXCEPTION
 #    else
-#        define PYBIND11_EXPORT_EXCEPTION PYBIND11_EXPORT
+#        define PYBIND11_EXPORT_EXCEPTION
 #    endif
 #endif

I.e. simply chop off PYBIND11_EXPORT in the one changed line, then build and run your real private project exactly as before. Do you still observe the same issue?

(I do have another idea that is only slightly more involved, but let's try this one first.)

@henryiii
Copy link
Collaborator

henryiii commented Oct 25, 2022

Have you tested the latest master of pybind11? We fixed a variety of things that could possibly be related.

Have you tested building the parent library with 2.10? IMO, this would be the most important thing to test. Showing 2.10 - 2.10 works would verify this is related to ABI (or a related leakage or ODR violation).

Both of these checks require no code. :)

Also, what sort of error are you getting?

@whophil
Copy link
Author

whophil commented Oct 25, 2022

@henryiii @rwgk I have some rather embarrassing news to report.

In building my reproducer, I was encountering really weird and inconsistent behavior. Digging deeper, I found that the issue does not seem to be in pybind11 at all - but rather related to some weird interaction between conda multi-python-version builds, pybind11, and CMake, resulting in CMake building Python extensions that did not match the Python interpreter version. For some reason, these problems only arise when building with pybind11 2.10, but aren't due to an ABI incompatibility between 2.9.2 and 2.10.

Building manually, I can confirm that my OCP extension library built with pybind11 2.10 works with a version of OCP built with pybind11 2.9.2.

I don't know if this comes as good or bad news to you. But in my specific application, this problem does seem to be resolved for now.

I want to say thanks very much for your help in trying to address my issues!

@henryiii
Copy link
Collaborator

resulting in CMake building Python extensions that did not match the Python interpreter version

I believe this was fixed with #3577, which would have been in 3.10.1. This is why I said "We fixed a variety of things that could possibly be related." I didn't know it would be this one, but that's why I want to ship fixes even if not every issue is resolved.

@henryiii
Copy link
Collaborator

Thank you so much for working on this! I'm sorry we didn't release 2.10.1 sooner and probably resolve your issue without forcing you to debug, but I am very grateful you did.

If you can try latest master, it should work. But don't worry about it, we should release soon.

@rwgk
Copy link
Collaborator

rwgk commented Oct 25, 2022

Awesome, thanks for letting us know!

I was really spooked by having broken something with #1895.

A couple small good things came out of this: I believe #2999 went far beyond what was actually needed (see #4284). We can trim it back after the 2.10.1 release. Also, I managed to tease out "real" ODR violation behavior under #4283, so strictly speaking the combination of #1895 & #2999 actually broke something, but only for highly unusual environments (roughly, using RTLD_GLOBAL under Linux & having cross-module exceptions).

rwgk pushed a commit to rwgk/pybind11 that referenced this issue Oct 31, 2022
Background: pybind#2999, pybind#4105, pybind#4283, pybind#4284

In a nutshell:

* Only macOS actually needs `PYBIND11_EXPORT_EXCEPTION` (pybind#4284).

* Evidently (pybind#4283), under macOS `PYBIND11_EXPORT_EXCEPTION` does not run the risk of introducing ODR violations,

* but evidently (pybind#4283) under Linux it does, in the presumably rare/unusual situation that `RTLD_GLOBAL` is used.

* Windows does no have the equivalent of `RTLD_GLOBAL`, therefore `PYBIND11_EXPORT_EXCEPTION` has no practical benefit, on the contrary, noisy warning suppression pragmas are needed, therefore it is best left empty.
henryiii pushed a commit that referenced this issue Oct 31, 2022
…#4298)

Background: #2999, #4105, #4283, #4284

In a nutshell:

* Only macOS actually needs `PYBIND11_EXPORT_EXCEPTION` (#4284).

* Evidently (#4283), under macOS `PYBIND11_EXPORT_EXCEPTION` does not run the risk of introducing ODR violations,

* but evidently (#4283) under Linux it does, in the presumably rare/unusual situation that `RTLD_GLOBAL` is used.

* Windows does no have the equivalent of `RTLD_GLOBAL`, therefore `PYBIND11_EXPORT_EXCEPTION` has no practical benefit, on the contrary, noisy warning suppression pragmas are needed, therefore it is best left empty.
@henryiii
Copy link
Collaborator

2.10.1 has been released.

@rwgk
Copy link
Collaborator

rwgk commented Nov 1, 2022

@whophil It's very likely that the deadlocks I was struggling with here have nothing to do with pybind11, but simply threads & fork are an unsafe combination: #4305, #4306

Unfortunately both PRs don't work internally (for interesting reasons unrelated to deadlocks), therefore I cannot easily verify that the deadlocks are gone. (Maybe it's easy, but I don't know how.)

@rwgk
Copy link
Collaborator

rwgk commented Nov 2, 2022

Update:

Unfortunately both PRs don't work internally (for interesting reasons unrelated to deadlocks), therefore I cannot easily verify that the deadlocks are gone. (Maybe it's easy, but I don't know how.)

I tried really hard to reproduce the flakes on my corp dev workstation (Debian derived), using Python 3.9 and Python 3.10, using current pybind11 master with this tiny tweak only:

test_gil_scoped.py:

-ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS + (_intentional_deadlock,)
-SKIP_IF_DEADLOCK = True  # See PR #4216
+ALL_BASIC_TESTS_PLUS_INTENTIONAL_DEADLOCK = ALL_BASIC_TESTS # (_intentional_deadlock,)
+SKIP_IF_DEADLOCK = False  # See PR #4216

I ran pytest 100 times with these options:

-s -vv --flake-finder --flake-runs=1000 -n 16

In those 100,000 runs total there isn't a single deadlock (or any other failure).

Unfortunately that means I still don't know how to prove that #4305 and/or #4306 resolve the "flakes" we saw in the GitHub CI over the past couple years.

Giving up.

We need to decide if we want to go with #4305 or #4306. The only tie breaker I can think of: which one makes test_gil_scoped.py run fastest?

Method: Run the test twice without doing anything else in between, take the 2nd "real" time. Results:

Small detail for completeness: the "spawn" timings are essentially identical for 1. using multiprocessing.get_context("spawn") in test_gil_scoped.py, 2. using multiprocessing.set_start_method("spawn") in conftest.py.

@henryiii @Skylion007 "forkserver" is clearly winning this one. Let's go with that and hope that we will never see any flakes or deadlocks again from test_gil_scoped.py

rwgk pushed a commit to rwgk/pybind11 that referenced this issue Nov 2, 2022
rwgk pushed a commit that referenced this issue Nov 18, 2022
…#4298)

Background: #2999, #4105, #4283, #4284

In a nutshell:

* Only macOS actually needs `PYBIND11_EXPORT_EXCEPTION` (#4284).

* Evidently (#4283), under macOS `PYBIND11_EXPORT_EXCEPTION` does not run the risk of introducing ODR violations,

* but evidently (#4283) under Linux it does, in the presumably rare/unusual situation that `RTLD_GLOBAL` is used.

* Windows does no have the equivalent of `RTLD_GLOBAL`, therefore `PYBIND11_EXPORT_EXCEPTION` has no practical benefit, on the contrary, noisy warning suppression pragmas are needed, therefore it is best left empty.
rwgk pushed a commit to rwgk/pybind11 that referenced this issue Nov 23, 2022
rwgk pushed a commit that referenced this issue Nov 23, 2022
* Use `multiprocessing` `start_method` `"forkserver"`

Alternative to PR #4305

* Add link to comment under PR #4105

* Unconditionally `pytest.skip("DEADLOCK")` for PyPy Windows

* Remove `SKIP_IF_DEADLOCK` entirely, for simplicity. Hopefully this PR will resolve the deadlocks for good.

* Add "In a nutshell" comment, in response to request by @EricCousineau-TRI
@rwgk
Copy link
Collaborator

rwgk commented Dec 1, 2022

FYI: We still seem to have a deadlock problem, even with multiprocessing start_method "forkserver". I'll track what I happen to see under #4373.

Note that this first one is with PYBIND11_SIMPLE_GIL_MANAGEMENT=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression Regression in a recent release triage New bug, unverified
Projects
None yet
Development

No branches or pull requests

4 participants