Description
Creating nested functions can be a source of reference count contention in the free-threaded build. Consider the following (contrived) example:
from concurrent.futures import ThreadPoolExecutor
def func(n):
sum = 0
for i in range(n):
def square(x):
return x ** 2
sum += square(i)
with ThreadPoolExecutor(max_workers=8) as executor:
future = executor.submit(func, 100000)
print(future.result())
Creating many square
functions concurrently causes reference count contention on the func_code
, func_globals
, and func_builtins
fields:
cpython/Include/cpython/funcobject.h
Lines 11 to 16 in 21d2a9a
The code object and builtins and globals dictionaries are already configured to support deferred reference counting, but the references in PyFunctionObject
are not _PyStackRef
s -- they are normal "counted" references.
Note that this is an issue for nested functions (closures), but not module-level functions or methods because those are typically created infrequently.
I outline a few possibly ways to address this below. My preference is for 2a below.
Option 1: Use deferred reference counting in PyFunctionObject
Variant 1a: Use _PyStackRef
in PyFunctionObject
Instead of PyObject *func_code
we have _PyStackRef func_code
. We use this strategy effectively in a number of other places, including the frame object and generators.
The downside of this approach is that the fields of PyFunctionObject
are exposed in public headers (cpython/funcobject.h
), even though they are not documented. Changing the type of func_code
, func_globals
, and func_builtins
risks breaking backwards compatibility with some C API extensions.
Variant 1b: Use PyObject*
and new bitfield
Instead of using _PyStackRef
, we can keep the fields as PyObject *
and store whether the field uses a deferred reference in a separate field. This was the approach I took by the nogil-3.9
fork.
This has fewer compatibility issues than using _PyStackRef
, but there are still compatibility hazards. It would not be safe for extensions to change func_code
/func_globals
/func_builtins
with something like Py_SETREF(func->func_globals, new_globals)
because the reference counting semantics are different.
Option 2: Use per-thread reference counting
We already use per-thread reference counting for the references from instances to their types (i.e., ob_type
), if the type is a heap type. Storing the reference counts per-thread avoids most of the reference count contention on the object. This also avoids the compatibility issues with option 1 because you can use a per-thread incref with a normal Py_DECREF
-- the only risk is performance, not correctness.
The challenge with this approach is that we need some quick and reliable way to index the per-thread reference count array. For heap types, we added a new field unique_id
in the free-threaded build. We can do something similar for code objects, but the globals and builtins are just "normal" dictionaries and I don't think we want to add a new field for every dictionary.
Variant 2a: Allocate space for identifier when creating globals and builtins.
When we create globals and builtins dictionaries we allocate space for an extra Py_ssize_t
unique id at the end after the PyDictObject
. The type would still just PyDict_Type
, so Python and the rest of the C API would just see a normal dictionary. We can identify these special dictionaries using a bit in ob_gc_bits
or by stealing another bit from ma_version_tag
.
If the globals or builtins dictionaries are replaced by user defined dictionaries, than things would still work, they'd just might have scaling bottlenecks.
Variant 2b: Use a hash table for per-thread references
We can use a hash table to map PyObject*
to their per-thread reference counts. This is less efficient than having a unique index into the per-thread reference count array, but avoids the need for an extra field.