Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PLAN] Tracking issue for 2.10.1 #4125

Closed
henryiii opened this issue Aug 8, 2022 · 33 comments · Fixed by #4279
Closed

[PLAN] Tracking issue for 2.10.1 #4125

henryiii opened this issue Aug 8, 2022 · 33 comments · Fixed by #4279

Comments

@henryiii henryiii changed the title [PLAN] Tracking issue for 3.10.1 [PLAN] Tracking issue for 2.10.1 Aug 8, 2022
@rwgk
Copy link
Collaborator

rwgk commented Oct 10, 2022

@henryiii @Skylion007 @ax3l

I've been turning over in my mind what we should do about #4105It leaves the 2.10 release in a doubt-full state, but unfortunately nobody provided even the slightest clue what the problem could be. I don't want to act on suspicions, as they could lead us astray badly and waste a lot of time. (Furthermore, if nobody cares enough to provide actionable clues, it puts the importance of the issue for them into doubt, too.)

In the meantime we also got:

Considering all of this together, I believe it will be best to abandon the 2.10 line b/o the doubts about ABI compatibility, and
go straight to 2.11
, bumping the internals version (to resolve the doubts) but leave the #ifs of #4218 (internals.h) in place for now, so people can choose to go back to internals version 4 if it works for them and they are ok with the doubts.

@henryiii
Copy link
Collaborator Author

I'm still strongly in favor of a 2.10.1. It will be be no worse than 2.10.0, and it needs to be done before 3.11 is released, which is combine up soon. Both of those "bigger" changes above really should sit in master for a bit before release.

Also, I don't think an ABI bump with a "go back" option is all that useful. Distributions like conda-forge can't handle "selectable" ABI bumps. A version of pybind11 has a set ABI, and they can run a migration if it changes. Let's release a 2.10.1 later this week, then put in those larger changes, maybe go ahead and merge some of the other ABI bump ideas we've had in the past, then do a 2.11 properly.

@rwgk
Copy link
Collaborator

rwgk commented Oct 10, 2022

'm still strongly in favor of a 2.10.1. It will be be no worse than 2.10.0

I think it is: it gives 2.10 more legitimacy than it deserves. Unsuspecting users will upgrade to 2.10.1, only to discover that they will want to 2.11, i.e. we're leading people into wasting effort.

I'm happy to cut the 2.11 release, as soon as #4201 is merged.

How much longer do you think you need to fix up the scikit_build_example? All the rest looks like it could be done with a few hours of effort at most, if we don't drag our feet, but I don't know about scikit.

Also, I don't think an ABI bump with a "go back" option is all that useful.

If we explain well (a couple sentences will do here I think) we give people a choice. I don't like to force people's hands, unless there really is no other way, which is evidently (#4228) not the case here.

then do a 2.11 properly.

We cannot do 2.10 properly because of the doubts-without-clues, but I don't see why we cannot do 2.11 properly very soon, minus the scikit issue that's been lingering for 2 months already. But we could just document that that's broken since 2.10 and we still haven't been able to fix.

@henryiii
Copy link
Collaborator Author

I'd prefer not to leave a release series "broken" - some users will undoubtedly be stuck on 2.10 for a while. A 2.10.1 with better Python 3.11 support would be seen as a "safer" upgrade than 2.11.0. I've moved to 2.10 everywhere and have been very happy with it. The ABI problem doesn't affect all users.

Even if we release 2.11 a couple of weeks later, I think we should leave the 2.10 series in the best possible state.

If we explain well (a couple sentences will do here I think) we give people a choice.

Only if people read it, and only if they can. Conda-forge, for example, doesn't have a choice and this will trigger a migration. Given 2.10 is the 'last' of a very long ABI series, I wouldn't be surprised if it's a mini Python 2.

I'm happy to cut the 2.11 release, as soon as #4201 is merged.

"bigger" changes really should sit in master for a bit before release.

How much longer do you think you need to fix up the scikit_build_example?

Probably this week. By next Monday, at the latest.

@henryiii
Copy link
Collaborator Author

By "broken" I'm referring to the Python 3.11 fixes, not the ABI incompatibility, which I'm not too worried about, at the most I think it's likely quite small.

@rwgk
Copy link
Collaborator

rwgk commented Oct 10, 2022

How about:

I'd prefer not to leave a release series "broken" - some users will undoubtedly be stuck on 2.10 for a while.

If they can upgrade to 2.10.1, with the option to pin internals version 4, why would they not want to take a 2.11 instead? Especially to get rid of #4226, which is a long-standing but nevertheless serious issue.

"bigger" changes really should sit in master for a bit before release.

Why? Our testing is extremely thorough. We're not forcing anyone to upgrade. Let people choose their own pace. Holding developments back because of version numbering considerations looks to me like putting the cart in front of the horse: version numbers are meant to inform, e.g. to help staying organized in a progressing environment, not to stifle progress.

Using my own situation as an example: the Eigen::Tensor work looks super useful. Nobody has been asking me for it, but I want it deployed Google-internally asap, I bet it will get picked up quickly when people see it. — Only for completeness: Releases don't matter to me directly (I merge to smart_holder, then deploy from there) but having a close match with releases helps keeping everybody on the same page internally and externally. Therefore I'm happy to put in the effort cutting 2.11.

@henryiii
Copy link
Collaborator Author

That's fine and about what I was thinking - I'm not against a 3.11 in the near future, I just want a 3.10.1 too. I don't think we are really thinking that differently.

Our testing is extremely thorough.

It's never as thorough as real world usage, and we have quite a few projects running off of master, as evidenced by the quick issues and PRs whenever we break master. Usually breaking changes to master get noticed within a week. Our testing is on lots of systems, but only our unit testing, which isn't full coverage (in fact, we have prioritized a quick test suite over a thorough one). Google's test suite is great, but focused - some things it just doesn't catch. ABI breakages, for example, can't be caught by either. New features are often not caught by either (though a little rough edges on a new feature is better than breaking old code). Some things just aren't used at all by Google that are important to other projects. I'd recommend at least letting any large PRs sit in master for at least week before releasing. If it's entirely additive (#4201), I don't think it's as important, but #4226 is not additive.

version numbers are meant to inform

I think they are also a bit of a mark of stability. master could have bugs - we try our best, we keep tests passing, but it is fresh, un-vetted code. stable points at the last release, and doesn't update otherwise - releases are "stable", not fresh, code. If you release immediately after a big PR is merged, that's not "stable". A version number is intended to be more stable. It's a point at which we say the code is "ready for use". And if we catch a mistake, we backport fixes and release a new patch. So far, everything we have caught is enough for a combined patch release, but not so critical we need to rush a patch. (Ideally, we should have probably released two patch releases, actually).

If you live at head, then you want master, and that's fine, but it means you have thorough CI checking things. Many projects would rather stable code and are happy if it's a little stale. Many projects also live at head which is way waiting a little bit is useful. :)

We branch before merging ...

#4228 is already merged? Is it something I need to keep out of the 2.10.1 release? I'm totally okay with branching, it's not that much work and much better than holding up progress. If you want to merge something (that's been reviewed), tag it with 2.11 and I'll work in the v2.10 branch.

@henryiii
Copy link
Collaborator Author

Unless I missed something, #4228 is the NVIDIA workaround? That's fine for a patch?...

I finally understand what the problem is with scikit_build_example. Not totally sure what the best fix for it is, but I know what's going wrong.

@ax3l
Copy link
Collaborator

ax3l commented Oct 11, 2022

Yes, the enable_if version in that PR is great as a patch for the nvcc issue :)

@rwgk
Copy link
Collaborator

rwgk commented Oct 11, 2022

That's fine and about what I was thinking - I'm not against a 3.11 in the near future, I just want a 3.10.1 too. I don't think we are really thinking that differently.

Cool, thanks! (typo? 2.11, 2.10.1? mainly mentioning in case someone looks here later)

#4201 good another great push today and I hope we can merge it tomorrow or so. (I hope to merge it into smart_holder and deploy internally right after.)

What's the best way to handle the 2.10 branch?

Is see you marked master already as 2.11.0.dev1 (8d82f29). Do we need to do something special before merging #4201?

@rwgk
Copy link
Collaborator

rwgk commented Oct 18, 2022

@henryiii the branch point for 2.10.1 is b926396 (that commit to be included in 2.10.1)

@rwgk
Copy link
Collaborator

rwgk commented Oct 19, 2022

@Henry Centos7 has been failing for a whole week now. I see EOL is still 1 yr 9 mo off (https://cloud.google.com/compute/docs/eol/centos-eol-guidance), but I'm thinking we should drop this ancient platform anyway (release date was 2014-07-07, before pybind11 even existed). The centos builds are regular troublemakers, nibbling away more than their fair share from our time. Could we re-host PGI 22.3 on a less troublesome OS, ideally a modern one?

@henryiii
Copy link
Collaborator Author

CetnOS 7 is not failing, and it's a major and incredibly important platform. It's the base for manylinux2014. As the last LTS release of CentOS 7, it's still common for a while longer elsewhere, too. What's failing is the PGI compilers on CentOS 7. I'm quite fine to move the PGI build over to whatever else they support. We'll still have a (still working) CentOS 7 build. But I rather think that's not the problem, it's the files they are hosting for PGI.

@rwgk
Copy link
Collaborator

rwgk commented Oct 20, 2022

CetnOS 7 is not failing, and it's a major and incredibly important platform.

I see now, we have "centos:7" too. I only looked for "CentOS7" before. I never meant to suggest dropping something that's still working and doesn't limit anything at the moment.

What could be a good host OS to pick for PGI?

@henryiii henryiii pinned this issue Oct 20, 2022
@henryiii
Copy link
Collaborator Author

CentOS 7? Given the issue is Nvidia stopped signing the binary, I don't think it matters what OS it's on. I think we just have to update from 22.3, current is 22.9.

@ax3l
Copy link
Collaborator

ax3l commented Oct 20, 2022

I think that's a good idea. I'll bump it in a PR to 22.7 (which we currently have at NERSC's Perlmutter supercomputer). But I am also generally ok to bump it higher if needed to address compiler bugs. Update, oh you did this already in #4260 👍

@Skylion007
Copy link
Collaborator

One more minor update we could squeeze in: #4269

@rwgk
Copy link
Collaborator

rwgk commented Oct 21, 2022

Link to my concerns about rushing out a 2.10 release: #4271 (comment)

@rwgk
Copy link
Collaborator

rwgk commented Oct 22, 2022

@henryiii this it to acknowledge that I have received your DM in response to my comment:

I don't want to stop you from making the release, but based on what I know at the moment, I'd definitely want to remove my name from README.rst on the v2.10 branch.

Thank you for letting me know your assessment of my wish in such clear terms.

I'm open to changing my mind as I receive new information, but based on what I know at this minute, I still wish to distance myself form the 2.10.1 release. As I wrote before, I believe it distracts us — and many others — from making a clean move going to 2.11, resolving the ABI doubt, and flipping the default for the GIL management implementation.

I'm also open to alternative suggestions for distancing myself from the 2.10.1 release. Simply removing my name from the list of contributors was meant to be a practically effortless way of doing that. I hope that my simple wish can be respected in one way or another.

I will continue to look at PRs, related to the release or not, and comment if I feel I have something helpful to contribute, even though I don't agree with the goal of continuing the 2.10 release series.

@henryiii
Copy link
Collaborator Author

Even if we did skip the 2.10.1 release, I would insist the default for GIL management would not switch until after 2.11. We need at least one release to warn users that supported API is going to disappear, and give them time to react. Having a 2.10.1 gives us a way to a) let users know about the ABI uncertainty, b) let users respond to the upcoming API removal, and c) fix various issues with distribution, packaging, and compilers that are causing issues.

@rwgk
Copy link
Collaborator

rwgk commented Oct 22, 2022

is going to disappear

I'm OK to keep the non-simple code for a year+ as long as it is opt-in and not opt-out (even though it makes the pybind11 code quite a bit more complicated b/o the internals pointer).

Having to add a define is in all likelihood a much smaller effort than having to debug a complex production system running into one of the reported issues.

@henryiii
Copy link
Collaborator Author

I think we can drop it on the first release after the known projects adapt to using their own code. I just require it to be opt-out for one release (either 2.10.1 or 2.11.0, your choice). I have a 10K+ word essay on versioning caping, and a key takeaway is that in the Python ecosystem, following the example of CPython, we provide deprecation periods and do not try to break users immediately.

@rwgk
Copy link
Collaborator

rwgk commented Oct 22, 2022

provide deprecation periods

That makes sense for deprecating features already replaced by new ones, but it makes no sense to apply the same logic to deprecating traps, especially in our situation, when we know of just 2 cases that need to make a trivial adjustment to buy safety for everybody else.

Could you please give me a firm timeline for when the default will be flipped from unsafe to safe according to your plans?

@henryiii
Copy link
Collaborator Author

but it makes no sense to apply the same logic to deprecating traps

Do we have an example of anything it fixes? I setup and tested the issue it was supposed to fix and it didn't fix it. I'd like a concrete example of it fixing something before calling it "safe/unsafe" instead of "simple/custom".

I think deprecation in 2.10.1, switch default in 2.11 is reasonable. I like that it makes the non-PyPy compatible code opt-in, rather than making something you can easily create that then doesn't work on PyPy. I would really like to see a) both known packages either add the flag or the custom workaround, and b) a concrete example of how to do this custom workaround manually that we can place in the upgrade guide.

when we know of just 2 cases

It is not on me to find all possible uses. I can't see private code, I didn't check gitlab or bitbucket or even all of GitHub. There are over 100K uses of PYBIND11_MODULE on GitHub public code alone. Currently we know it might help one user (PyTorch) that has already worked around the issue, and it will break two significant packages. It won't fix currently broken code, it just will enable more flexible code in the future (assuming it does work).

@rwgk
Copy link
Collaborator

rwgk commented Oct 24, 2022

To add to my #4105 (comment) comment here:

My previous resistance to continuing the 2.10 release series was based on the assumption that incrementing the internals version will resolve the ABI incompatibility doubts. I no longer believe that. I think that we're up against a different kind of ABI issue, a bad interaction between #2999 and #1895. I also think that we may be able to find a way to handle or at least mitigate the situation, at a minimum with a clear explanation in the documentation. With that it would make sense to continue the 2.10 series, after we have full, agreed-on clarity.

@Skylion007
Copy link
Collaborator

One other potential blocker. Should be a quick fix once we figure out the right behavior: #4288

@henryiii
Copy link
Collaborator Author

"something we haven't fixed but is present in the last release" is not a "blocker". If that was true, there would never be a release. Blockers are only issues that are not in a previous release. However, it's fine to include a quick fix if it makes it in and looks safe. But I object to it being called a blocker. :)

@Skylion007
Copy link
Collaborator

Skylion007 commented Oct 26, 2022

@henryiii I suppose technically it could be a regression since the exception used to be in the ctor and therefore you could have a try catch around the error through which isn't viable anymore. Also I think part of this might be due to our more aggressive conversion to a std::string.

@henryiii
Copy link
Collaborator Author

In 2.10.0? Or after 2.10.0? It's only a blocker if it happened after 2.10.0. A blocker means the release is worse than the release before. After the release, a missed blocker becomes a regression. There is no requirement that we fix all regressions in the first point release. There can always be a second point release.

Again, happy to get it fixed if it's easy.

@Skylion007
Copy link
Collaborator

Under those terms, it's a regression. Still a pretty nasty one though.

@rwgk
Copy link
Collaborator

rwgk commented Oct 31, 2022

CI for #4297 (fixes #4288) is running, once that's merged let's pull the trigger on 2.10.1?

@henryiii is there a chance that you could help with the cherry picks and changelog update? (I don't think I've ever done that, and I already lost a ton of time on #4216, with the deadlocks still hanging over my head.)

@henryiii
Copy link
Collaborator Author

Yes, I can do that, I should have time tomorrow or Tuesday. I've already done most of the cherry picks, I'll just need the last 1-2 PRs.

@rwgk
Copy link
Collaborator

rwgk commented Oct 31, 2022

Thanks a lot @henryiii!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants