Skip to content

gh-127266: avoid data races when updating type slots v2 #133177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 28, 2025

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Apr 29, 2025

This is an updated version of GH-131174, which was reverted. I figured the cleanest thing to do is make a new PR.

This is the same as the previous PR with the following additional change. The update_all_slots() and type_setattro() functions are now more careful when the world is stopped. Instead of doing the MRO lookups while the world is stopped, we do them all first and collect the slot pointers to be updated. Then, we stop the world and do those updates. This makes it much easier to confirm the code running during the stop-the-world is safe and that should avoid the deadlocks.

The test_opcache test has become quite a bit slower. It seems to be due to mutex contention in the __getitem__ and __getattribute__ method assignment tests. I reduced the items count from 1000 to 100 to keep the test from becoming much slower.

In the free-threaded build, avoid data races caused by updating type
slots or type flags after the type was initially created.  For those
(typically rare) cases, use the stop-the-world mechanism.  Remove the
use of atomics when reading or writing type flags.  The use of atomics
is not sufficient to avoid races (since flags are sometimes read without
a lock and without atomics) and are no longer required.
To avoid deadlocks while the world is stopped, we need to avoid calling APIs
like _PyObject_HashFast() and _PyDict_GetItemRef_KnownHash().  Collect the
slot updates to be done and then apply them all at once.  This reduces the
amount of code running in the stop-the-world condition.
@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 30, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit d511ca6 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133177%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 30, 2025
Now that stop-the-world is used to do the slot update, these tests
are a lot slower in the free-threaded build.  Test with fewer items,
which will still hopefully be enough to find bugs in the specializer.
The clearing of Py_TPFLAGS_HAVE_VECTORCALL must be done when the world
is stopped too.
@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 30, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 3cb2256 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133177%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 30, 2025
@nascheme nascheme marked this pull request as ready for review April 30, 2025 13:29
@nascheme nascheme requested a review from markshannon as a code owner April 30, 2025 13:29
@nascheme nascheme requested a review from colesbury April 30, 2025 13:33
Since we stack allocate one chunk, we need to check 'n' to see if there
are actually any updates to make.  It's pretty common that no updates
are actually needed.
@nascheme
Copy link
Member Author

nascheme commented May 2, 2025

@colesbury Not sure if you would like this approach but it is a variation on your idea to modify the critical section to prevent release of the mutex. I changed it to work with a PyCriticalSection2 as well.

https://github.com/nascheme/cpython/tree/type-slot-ts-release-hack

Not using a critical section for the type dict looks kind of difficult to do. For example, _Py_dict_lookup asserts that the dict is locked. I think we would have to duplicate some dictobject functions to make non-lock-asserting versions.

@colesbury
Copy link
Contributor

The type-slot-ts-release-hack approach makes sense to me

@nascheme
Copy link
Member Author

nascheme commented May 5, 2025

See GH-133467 for some remaining data race issues.

If the two mutex form of the critical section is used, need to put the
other mutex into '_cs_mutex'.
@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 5, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 3f6222b 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133177%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 5, 2025
@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 27, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit c1f3ed5 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133177%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 27, 2025
The apply_slot_updates_world_stopped() name implies that the world
is already stopped, based on convention.  Rename for clarity.
@nascheme nascheme merged commit fbbbc10 into python:main May 28, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants