Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-125985: Add free threading scaling micro benchmarks #125986

Merged
merged 6 commits into from
Oct 28, 2024

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Oct 25, 2024

These consist of a number of short snippets that help identify scaling bottlenecks in the free threaded interpreter.

The current bottlenecks are in calling functions in benchmarks that call functions (due to LOAD_ATTR not yet using deferred reference counting) and when accessing thread-local data.

These consist of a number of short snippets that help identify scaling
bottlenecks in the free threaded interpreter.

The current bottlenecks are in calling functions in benchmarks that call
functions (due to `LOAD_ATTR` not yet using deferred reference counting)
and when accessing thread-local data.
@colesbury
Copy link
Contributor Author

colesbury commented Oct 25, 2024

Some results below:

CPython 3.14t results
object_cfunction    1.1x faster
cmodule_function    1.1x slower
mult_constant       9.7x faster
generator           9.5x faster
pymethod            1.1x faster
pyfunction          9.7x faster
module_function     1.2x slower
load_string_const   9.5x faster
load_tuple_const    9.8x faster
create_pyobject     9.5x faster
create_closure      9.9x faster
create_dict         9.8x faster
thread_local_read   2.2x slower
CPython 3.13t results
object_cfunction    9.8x faster
cmodule_function    9.6x faster
mult_constant       8.8x faster
generator           9.5x faster
pymethod            9.7x faster
pyfunction          9.7x faster
module_function    10.0x faster
load_string_const   1.8x slower
load_tuple_const    9.5x faster
create_pyobject     9.5x faster
create_closure      9.8x faster
create_dict         8.8x faster
thread_local_read   2.0x slower
nogil fork (3.9) results
object_cfunction   10.4x faster
cmodule_function    9.2x faster
mult_constant      10.0x faster
generator           9.0x faster
pymethod            9.5x faster
pyfunction         10.0x faster
module_function    10.0x faster
load_string_const   9.8x faster
load_tuple_const    9.8x faster
create_pyobject     9.6x faster
create_closure      9.2x faster
create_dict         9.6x faster
thread_local_read   9.0x faster

As mentioned in the PR description, we have known scaling issues related to LOAD_ATTR not using deferred reference counting yet. We also have a scaling issue when reading thread-local data -- we should probably enable deferred reference counting on _thread._local objects.

The 3.13 release avoids the LOAD_ATTR scaling issues due to immortalization. However, we apparently have a bug related to string immortalization (load_string_const is slow) and the thread-local bottleneck is also present.

Note that small variations (e.g. 8.8x vs. 10.4x) are not meaningful.

@colesbury colesbury marked this pull request as ready for review October 25, 2024 17:20
@colesbury colesbury requested review from mpage and Yhg1s October 25, 2024 17:20
@colesbury colesbury requested review from mpage and tomasr8 October 28, 2024 16:16
@colesbury colesbury merged commit 00ea179 into python:main Oct 28, 2024
34 checks passed
@colesbury colesbury deleted the gh-125985-ftscalingbench branch October 28, 2024 21:47
picnixz pushed a commit to picnixz/cpython that referenced this pull request Dec 8, 2024
…125986)

These consist of a number of short snippets that help identify scaling
bottlenecks in the free threaded interpreter.

The current bottlenecks are in calling functions in benchmarks that call
functions (due to `LOAD_ATTR` not yet using deferred reference counting)
and when accessing thread-local data.
ebonnal pushed a commit to ebonnal/cpython that referenced this pull request Jan 12, 2025
…125986)

These consist of a number of short snippets that help identify scaling
bottlenecks in the free threaded interpreter.

The current bottlenecks are in calling functions in benchmarks that call
functions (due to `LOAD_ATTR` not yet using deferred reference counting)
and when accessing thread-local data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants