Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-122881: Reduce asyncio heapq scheduling overhead #122882

Closed
wants to merge 14 commits into from

Conversation

bdraco
Copy link
Contributor

@bdraco bdraco commented Aug 10, 2024

Wrap the TimerHandle in a tuple with the when value at the front to avoid having to call TimerHandle.__lt__ for heapq operations.

Wrap the TimerHandle in tuples with the when value at the front
to avoid having to call `TimerHandle.__lt__` when working with
the `heapq`
@bdraco bdraco marked this pull request as ready for review August 10, 2024 14:09
…_n6W-.rst

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks.

Lib/asyncio/base_events.py Outdated Show resolved Hide resolved
@picnixz picnixz self-requested a review August 12, 2024 09:20
Lib/asyncio/base_events.py Outdated Show resolved Hide resolved
Lib/asyncio/base_events.py Outdated Show resolved Hide resolved
bdraco and others added 2 commits August 12, 2024 07:33
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have numbers, you could also add them to the NEWS entry, saying that you improved the performances by a factor X for instance, otherwise I'm good (still requires a core dev for the final acceptance).

@bdraco
Copy link
Contributor Author

bdraco commented Aug 12, 2024

If you have numbers, you could also add them to the NEWS entry, saying that you improved the performances by a factor X for instance, otherwise I'm good (still requires a core dev for the final acceptance).

It works out to a

  • ~10.2% speed up to in _run_once on my production Home Assistant instance but its already well optimized to reduce scheduling as much as possible so I'd expect other use cases have more substantial speed ups
  • ~9.5% speed up scheduling and running timer handles

@bdraco
Copy link
Contributor Author

bdraco commented Aug 12, 2024

Added the number for the TimerHandle improvement. _run_once is going to vary greatly based on the use case since some use cases will schedule far more than others. My production use case was previously an 18% speed up with _run_once before aiohttp was optimized to remove a lot of unneeded TimerHandle creation.

@bdraco
Copy link
Contributor Author

bdraco commented Aug 12, 2024

Thanks!

@itamaro
Copy link
Contributor

itamaro commented Aug 12, 2024

I have a few questions

  1. could you please split out all the attribute caching optimizations to a separate PR (and benchmark them separately)? it's not clear that they improve things in a measurable way, and it hurts readability.
  2. speaking of benchmarking - the microbenchmarks look good, but is there a measurable impact on the asyncio pyperformance benchmarks?
  3. what is the memory impact? iiuc, this would use strictly more memory (extra tuple and when per heapq item), but I have no idea whether the increase is meaningful or negligible.

@bdraco
Copy link
Contributor Author

bdraco commented Aug 12, 2024

I have a few questions

  1. could you please split out all the attribute caching optimizations to a separate PR (and benchmark them separately)? it's not clear that they improve things in a measurable way, and it hurts readability.

This has already been discussed above. Is this strictly required?

  1. speaking of benchmarking - the microbenchmarks look good, but is there a measurable impact on the asyncio pyperformance benchmarks?

I'm not sure the pyperformance would be a better benchmark here as I think the real-world use case impact is more important.

  1. what is the memory impact? iiuc, this would use strictly more memory (extra tuple and when per heapq item), but I have no idea whether the increase is meaningful or negligible.
>>> (1.12,object()).__sizeof__()
40

The memory impact should be negligible compared to the cost of TimerHandle

@bdraco
Copy link
Contributor Author

bdraco commented Aug 24, 2024

In recent versions of Python, attribute caching of bound methods usually hurts performance. So yes, let's remove it, unless we can demonstrate meaningful improvement in a benchmark

@hauntsaninja I pushed the commit I had original held off on

a5b6647 but than I realized you deleted your comment

Lib/asyncio/base_events.py Outdated Show resolved Hide resolved
@hauntsaninja
Copy link
Contributor

Ah, I deleted my comment because only the ready_popleft was a pessimisation and I didn't really want to get into it :-) That said, I do like having self if it's cheap, I find it easier to reason about state. Thanks for the change!

@bdraco
Copy link
Contributor Author

bdraco commented Aug 24, 2024

Ah, I deleted my comment because only the ready_popleft was a pessimisation and I didn't really want to get into it :-) That said, I do like having self if it's cheap, I find it easier to reason about state. Thanks for the change!

The self changes made a tiny difference compared to avoiding the __lt__ overhead so it not worthing getting hung up on anyways.

Copy link
Contributor

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

@mdboom
Copy link
Contributor

mdboom commented Aug 29, 2024

I ran this PR against the pyperformance suite on our benchmarking hardware. The results overall are basically "no change", within the noise. The vast majority of the benchmarks there don't use async, however, we do have the async benchmarks broken out separately, and even there the results are kind of inconclusive. The async benchmarks in pyperformance are actually known to have a great deal of inherent variability, so, honestly it's hard to conclude too much from them.

Results on async benchmarks only

Platform Speedup
Linux aarch64 1% slower
Linux x86_64 0% slower
Windows x86_64 1% faster
Windows x86 2% faster
macOS arm 1% slower

From these results, I'd say there's no obvious win or obvious red flag to merging this. The microbenchmark in #122881 shows a significant improvement, and if the additional code complexity is acceptable to others, this is probably fine to merge.

@bdraco
Copy link
Contributor Author

bdraco commented Aug 29, 2024

I took a look at https://github.com/python/pyperformance/tree/main/pyperformance and I don't see any asyncio benchmarks that will generate a significant amount of TimerHandles so I wouldn't expect you would see any significant change in the benchmarks.

@bdraco
Copy link
Contributor Author

bdraco commented Aug 29, 2024

For context, I've been working on addressing some performance complaints from aiohttp users. What finally pushed me to submit this upstream was how much more time is spent in aiohttp.web_ws.WebSocketResponse.receive() and aiohttp.client.ClientSession.get() with a timeout.

Copy link
Member

@1st1 1st1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion these sort of micro-optimizations aren't worth it as they mostly just harm readability. The effect of this optimization will be barely detectable. If you want performance - use uvloop.

-1.

@willingc
Copy link
Contributor

For context, I've been working on addressing some performance complaints from aiohttp users. What finally pushed me to submit this upstream was how much more time is spent in aiohttp.web_ws.WebSocketResponse.receive() and aiohttp.client.ClientSession.get() with a timeout.

Can you link to the issues re: aiohttp?

Copy link
Contributor

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see the aiohttp performance issues before I weigh in on the need for this change.

@Dreamsorcerer
Copy link
Contributor

Dreamsorcerer commented Sep 10, 2024

I would like to see the aiohttp performance issues before I weigh in on the need for this change.

bdraco was analysing performance issues, but some of the discussion around that is in aio-libs/aiohttp#8608 and the thread: aio-libs/aiohttp#8608 (comment)

I think one concern here is that .reschedule is a public, high-level API: https://docs.python.org/3/library/asyncio-task.html#asyncio.Timeout.reschedule
And this can result in this heapq rescheduling, so if a user does use that a lot it could have a significant performance impact.

@willingc
Copy link
Contributor

I think one concern here is that .reschedule is a public, high-level API: https://docs.python.org/3/library/asyncio-task.html#asyncio.Timeout.reschedule
And this can result in this heapq rescheduling, so if a user does use that a lot it could have a significant performance impact.

Thanks @Dreamsorcerer for additional information.

I'm at -0 in my view on this change at present. Personally, I would need to see more evidence of real world impact for me to support this optimization. I will remove the "Do Not Merge" label in case another core team member feels strongly to merge this.

Thanks @bdraco for the PR too and the detailed explanations.

@vstinner
Copy link
Member

vstinner commented Sep 18, 2024

#122881 (comment) microbenchmark doesn't use modified code: it doesn't use the asyncio event loop :-( It's a benchmark on heapq.heappush() and heapq.heappop().

Can you write a benchmark using the asyncio event loop? For example, schedule 1000 callbacks with call_at().

@bdraco
Copy link
Contributor Author

bdraco commented Sep 18, 2024

#122881 (comment) microbenchmark doesn't use modified code: it doesn't use the asyncio event loop :-( It's a benchmark on heapq.heappush() and heapq.heappop().

Can you write a benchmark using the asyncio event loop? For example, schedule 1000 callbacks with call_at().

#122881 (comment)

Is this what you are looking for?

@itamaro
Copy link
Contributor

itamaro commented Sep 19, 2024

As a benchmarking data point - I applied a minimally modified version (ported to cinder 3.10) of this PR to Instagram Server, and there was no measurable perf impact.

@1st1
Copy link
Member

1st1 commented Sep 23, 2024

As a benchmarking data point - I applied a minimally modified version (ported to cinder 3.10) of this PR to Instagram Server, and there was no measurable perf impact.

Yeah, basically what I expected. I think let's close it, unless the author comes up with a more significant and provable optimization.

@1st1 1st1 closed this Sep 23, 2024
@bdraco
Copy link
Contributor Author

bdraco commented Sep 23, 2024

I did link to a benchmark with call_at above #122881 (comment)

If the instagram server isn't doing a lot of call_ats, I wouldn't expect any performance difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants