Reference count contention with nested functions

Creating nested functions can be a source of reference count contention in the free-threaded build. Consider the following (contrived) example:

```python
from concurrent.futures import ThreadPoolExecutor

def func(n):
    sum = 0
    for i in range(n):
        def square(x):
            return x ** 2
        sum += square(i)

with ThreadPoolExecutor(max_workers=8) as executor:
    future = executor.submit(func, 100000)
    print(future.result())
```

Creating many `square` functions concurrently causes reference count contention on the `func_code`, `func_globals`, and `func_builtins` fields:

https://github.com/python/cpython/blob/21d2a9ab2f4dcbf1be462d3b7f7a231a46bc1cb7/Include/cpython/funcobject.h#L11-L16

The code object and builtins and globals dictionaries are already configured to support deferred reference counting, but the references in `PyFunctionObject` are not `_PyStackRef`s -- they are normal "counted" references.

Note that this is an issue for nested functions (closures), but not module-level functions or methods because those are typically created infrequently.

I outline a few possibly ways to address this below. My preference is for 2a below.

## Option 1: Use deferred reference counting in `PyFunctionObject`

### Variant 1a: Use `_PyStackRef` in `PyFunctionObject`

Instead of `PyObject *func_code` we have `_PyStackRef func_code`. We use this strategy effectively in a number of other places, including the frame object and generators.

The downside of this approach is that the fields of `PyFunctionObject` are exposed in public headers (`cpython/funcobject.h`), even though they are not documented. Changing the type of `func_code`, `func_globals`, and `func_builtins` risks breaking backwards compatibility with some C API extensions.

### Variant 1b: Use `PyObject*` and new bitfield

Instead of using `_PyStackRef`, we can keep the fields as `PyObject *` and store whether the field uses a deferred reference in a separate field. This was the [approach](https://github.com/python/cpython/blob/21d2a9ab2f4dcbf1be462d3b7f7a231a46bc1cb7/Include/cpython/funcobject.h#L11-L16) I took by the `nogil-3.9` fork.

This has fewer compatibility issues than using `_PyStackRef`, but there are still compatibility hazards. It would not be safe for extensions to change `func_code`/`func_globals`/`func_builtins` with something like `Py_SETREF(func->func_globals, new_globals)` because the reference counting semantics are different. 

## Option 2: Use per-thread reference counting

We already use [per-thread reference counting](https://peps.python.org/pep-0703/#reference-counting-type-objects) for the references from instances to their types (i.e., `ob_type`), if the type is a heap type. Storing the reference counts per-thread avoids most of the reference count contention on the object. This also avoids the compatibility issues with option 1 because you can use a per-thread incref with a normal `Py_DECREF` -- the only risk is performance, not correctness.

The challenge with this approach is that we need some quick and reliable way to index the per-thread reference count array. For heap types, we added a new field [`unique_id`](https://github.com/python/cpython/blob/21d2a9ab2f4dcbf1be462d3b7f7a231a46bc1cb7/Include/cpython/object.h#L275) in the free-threaded build. We can do something similar for code objects, but the globals and builtins are just "normal" dictionaries and I don't think we want to add a new field for every dictionary.

## Variant 2a: Allocate space for identifier when creating globals and builtins.

When we create globals and builtins dictionaries we allocate space for an extra `Py_ssize_t` unique id at the end after the `PyDictObject`. The type would still just `PyDict_Type`, so Python and the rest of the C API would just see a normal dictionary. We can identify these special dictionaries using a bit in `ob_gc_bits` or by stealing another bit from `ma_version_tag`.

If the globals or builtins dictionaries are replaced by user defined dictionaries, than things would still work, they'd just might have scaling bottlenecks.

## Variant 2b: Use a hash table for per-thread references

We can use a hash table to map `PyObject*` to their per-thread reference counts. This is less efficient than having a unique index into the per-thread [reference count array](https://github.com/python/cpython/blob/21d2a9ab2f4dcbf1be462d3b7f7a231a46bc1cb7/Include/internal/pycore_tstate.h#L36), but avoids the need for an extra field.


### Linked PRs
* gh-124844
* gh-125216
* gh-125713
* gh-125847

	#define _Py_COMMON_FIELDS(PREFIX) \
	PyObject *PREFIX ## globals; \
	PyObject *PREFIX ## builtins; \
	PyObject *PREFIX ## name; \
	PyObject *PREFIX ## qualname; \
	PyObject PREFIX ## code; / A code object, the __code__ attribute */ \

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reference count contention with nested functions #124218

Option 1: Use deferred reference counting in `PyFunctionObject`

Variant 1a: Use `_PyStackRef` in `PyFunctionObject`

Variant 1b: Use `PyObject*` and new bitfield

Option 2: Use per-thread reference counting

Variant 2a: Allocate space for identifier when creating globals and builtins.

Variant 2b: Use a hash table for per-thread references

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Reference count contention with nested functions #124218

Description

Option 1: Use deferred reference counting in PyFunctionObject

Variant 1a: Use _PyStackRef in PyFunctionObject

Variant 1b: Use PyObject* and new bitfield

Option 2: Use per-thread reference counting

Variant 2a: Allocate space for identifier when creating globals and builtins.

Variant 2b: Use a hash table for per-thread references

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Option 1: Use deferred reference counting in `PyFunctionObject`

Variant 1a: Use `_PyStackRef` in `PyFunctionObject`

Variant 1b: Use `PyObject*` and new bitfield