-
-
Notifications
You must be signed in to change notification settings - Fork 33.5k
Description
Creating nested functions can be a source of reference count contention in the free-threaded build. Consider the following (contrived) example:
from concurrent.futures import ThreadPoolExecutor
def func(n):
sum = 0
for i in range(n):
def square(x):
return x ** 2
sum += square(i)
with ThreadPoolExecutor(max_workers=8) as executor:
future = executor.submit(func, 100000)
print(future.result())Creating many square functions concurrently causes reference count contention on the func_code, func_globals, and func_builtins fields:
cpython/Include/cpython/funcobject.h
Lines 11 to 16 in 21d2a9a
| #define _Py_COMMON_FIELDS(PREFIX) \ | |
| PyObject *PREFIX ## globals; \ | |
| PyObject *PREFIX ## builtins; \ | |
| PyObject *PREFIX ## name; \ | |
| PyObject *PREFIX ## qualname; \ | |
| PyObject *PREFIX ## code; /* A code object, the __code__ attribute */ \ |
The code object and builtins and globals dictionaries are already configured to support deferred reference counting, but the references in PyFunctionObject are not _PyStackRefs -- they are normal "counted" references.
Note that this is an issue for nested functions (closures), but not module-level functions or methods because those are typically created infrequently.
I outline a few possibly ways to address this below. My preference is for 2a below.
Option 1: Use deferred reference counting in PyFunctionObject
Variant 1a: Use _PyStackRef in PyFunctionObject
Instead of PyObject *func_code we have _PyStackRef func_code. We use this strategy effectively in a number of other places, including the frame object and generators.
The downside of this approach is that the fields of PyFunctionObject are exposed in public headers (cpython/funcobject.h), even though they are not documented. Changing the type of func_code, func_globals, and func_builtins risks breaking backwards compatibility with some C API extensions.
Variant 1b: Use PyObject* and new bitfield
Instead of using _PyStackRef, we can keep the fields as PyObject * and store whether the field uses a deferred reference in a separate field. This was the approach I took by the nogil-3.9 fork.
This has fewer compatibility issues than using _PyStackRef, but there are still compatibility hazards. It would not be safe for extensions to change func_code/func_globals/func_builtins with something like Py_SETREF(func->func_globals, new_globals) because the reference counting semantics are different.
Option 2: Use per-thread reference counting
We already use per-thread reference counting for the references from instances to their types (i.e., ob_type), if the type is a heap type. Storing the reference counts per-thread avoids most of the reference count contention on the object. This also avoids the compatibility issues with option 1 because you can use a per-thread incref with a normal Py_DECREF -- the only risk is performance, not correctness.
The challenge with this approach is that we need some quick and reliable way to index the per-thread reference count array. For heap types, we added a new field unique_id in the free-threaded build. We can do something similar for code objects, but the globals and builtins are just "normal" dictionaries and I don't think we want to add a new field for every dictionary.
Variant 2a: Allocate space for identifier when creating globals and builtins.
When we create globals and builtins dictionaries we allocate space for an extra Py_ssize_t unique id at the end after the PyDictObject. The type would still just PyDict_Type, so Python and the rest of the C API would just see a normal dictionary. We can identify these special dictionaries using a bit in ob_gc_bits or by stealing another bit from ma_version_tag.
If the globals or builtins dictionaries are replaced by user defined dictionaries, than things would still work, they'd just might have scaling bottlenecks.
Variant 2b: Use a hash table for per-thread references
We can use a hash table to map PyObject* to their per-thread reference counts. This is less efficient than having a unique index into the per-thread reference count array, but avoids the need for an extra field.