5x speedups without JIT compiler + portability... #148

sbrunthaler · 2021-10-08T13:19:37Z

sbrunthaler
Oct 8, 2021

Hi guys,

one of my students just pointed me at your repository and I just wanted to briefly touch base and highlight some of my own prior work/research in optimizing the CPython interpreter. Pretty much ten years ago, I did extensive research on purely-interpretative optimizations, mostly through inline caching with quickening. Subsequent research on combining multiple different techniques led to maximum speedups of 5.5x, without requiring a JIT compiler. AFAICT, my research would ideally suited for stages 1 and 2 of your implementation plan. Due to some interest expressed on this research on Twitter about three months ago, I put together the paper and the many rejections it got in academia. If you're interested, please take a look: https://arxiv.org/abs/2109.02958

Based on my experience, it should be possible to obtain much of the proposed speedups by focusing just on the interpreter-based optimizations. A simple JIT is, IMHO, not going to provide more speedups beyond the interpreter. (Similar to having a template JIT vs an optimizing interpreter, as the template JIT mostly targets instruction dispatch costs, which are usually not dominant costs in Python.) A nice benefit is that this strategy maintains portability, and could even be used to provide optimizations across extensions written in C (numpy, etc.).

If I can provide any further explanations beyond the paper, please do let me know!

Other than that: Have a nice day and weekend, respectively & all the best from Munich,
--stefan

gvanrossum · 2021-10-08T15:32:25Z

gvanrossum
Oct 8, 2021
Maintainer

Wow, thanks! Your work was actually (part of) our inspiration. Another speedup project, Pyston, even mentioned your original bpo issue, bpo-14757.

I don’t know why we don’t yet see such dramatic improvements, but it may be because there are already so many other optimizations — IIRC you worked on Python 3.3.

0 replies

sbrunthaler · 2021-10-08T19:08:15Z

sbrunthaler
Oct 8, 2021
Author

I've seen the "old" patch set from 2012. AFAIR, this is mostly just inline caching of interpreter instructions, including some nifty caching of LOAD_GLOBAL/LOAD_ATTR. The link to the new paper (was unfortunately never accepted anywhere) does some more aggressive optimizations, which is where the speedups come from.

First, it also maps Python objects to native-machine data types and rewrites longer sequences of CPython interpreter instructions to heavily optimized interpreter instructions, similar to how a JIT compiler would optimize CPython, but thanks to the purely-interpretative nature, still portable. One cool side effect of optimizing CPython like this is that retains full compatibility with existing reference-counting semantics (also modules/extensions written in C), while being able to reduce memory consumption for hot paths. IIRC in spectralnorm, my interpreter used less memory than standard CPython.
Second, another 1x speedup improvement comes from using native-machine type based superinstructions: The longer sequences of type and data representation optimized interpreter instructions have so little overhead in their corresponding interpreter instruction implementations that they finally end up suffering from instruction dispatch costs. I wrote a small recorder for the optimized sequences, which can be used to generate new superinstructions pretty easily. Those superinstructions can be baked into the CPython interpreter at interpreter compile-time and are highly optimized by the ahead-of-time compiler (being able to produce heavily optimized native-machine code).

In combination, both techniques allow for substantial speedups. Of course this depends also on the speedup potential of the relevant benchmarks. In the linked paper, I also did some analysis to approximate the overhead of interpretation on specific programs and found that the technique eliminates most of this overhead. Even doing this analysis on prospective benchmarks a priori would be highly relevant, as this serves as a good guideline to how good the optimizations are performing.

If I can be of any help to your efforts, let me know. Also happy to do a VC!

0 replies

markshannon · 2021-10-11T18:35:02Z

markshannon
Oct 11, 2021
Collaborator

@sbrunthaler Did you benchmark your work on any larger benchmarks than the ones in your paper?
The computer language shootout benchmarks are heavily numeric and show very high levels of type stability.
"Realistic" programs tend to be a lot more messy.

I'd be intrigued to see how your interpreter performs on the pyperformance benchmark suite (doesn't need to be all of them, just those you can get to run).

Also, did you maintain support for sys,settrace and precise tracebacks?
It is easy to get some speed up by dropping these, but we need to support them.

0 replies

sbrunthaler · 2021-10-12T11:56:59Z

sbrunthaler
Oct 12, 2021
Author

@sbrunthaler Did you benchmark your work on any larger benchmarks than the ones in your paper? The computer language shootout benchmarks are heavily numeric and show very high levels of type stability. "Realistic" programs tend to be a lot more messy.

That were pretty much all the benchmarks available to compare performance against other techniques. Regarding type stability, one of our grad students at UCI did an analysis on several programs and found that type stability was closer to 99% compared to the relevant value cited from Baden's '82 paper (95%). That notwithstanding, if type stability is an issue with a program, neither the JIT compiler nor the optimizing interpreter will be tremendously successful. (Maybe the interpreter will be a little bit faster overall, since no code gen and the low latency optimization effort, plus no maintenance of a code cache.)

I'd be intrigued to see how your interpreter performs on the pyperformance benchmark suite (doesn't need to be all of them, just those you can get to run).

I'll take a look, but need to dig up a working system, so that will take a while 😉
I just briefly glanced over the pyperformance benchmark and saw that a significant fraction are similar to the computer language shootout benchmarks, so those will be identical. For the remainder, I saw, e.g., bm_django, which works for my system, but as per my analysis in Table 3 on page 12 of my paper exhibits not a lot of interpretive overhead. Which, IIRC, reflects my experience with that benchmark quite well: It did not really matter what I optimized, it never exceeded the 30% speedup that is bound to interpretive overhead. Note that this does not say that you can't exceed the 30% speedup, but that the interpreter spends about 30% of its time there. More speedup could certainly be had, e.g., through specializing for some routines of Django. (Which might be a good idea to do anyways, but I decided not to do, since from an academic perspective this is not relevant.)

Also, did you maintain support for sys,settrace and precise tracebacks? It is easy to get some speed up by dropping these, but we need to support them.

Nope, I did not. I briefly looked at it yesterday evening and see no problem in supporting this. If remember that there were some hooks in the CPython interpreter that could be executed before or after each interpreter instruction, if this is needed for supporting settrace, then there are two easy fixes: (i) either not optimizing a function, or (ii) disabling superinstructions. Tracebacks are not a problem, one needs a similar feature for deopt anyways.

0 replies

sbrunthaler · 2021-10-12T11:59:17Z

sbrunthaler
Oct 12, 2021
Author

@sbrunthaler Are you interested in a chat with the current Faster Python team? We have weekly meetings on Teams that are open to non-MS employees.

I can very well participate in such a chat and am able to run Teams. I'm on CE(S)T and booked for at least the next 12 days, so if a date after that works for the team that would be perfect.

0 replies

kmod · 2021-10-12T18:26:30Z

kmod
Oct 12, 2021

fwiw bm_django.py is a test of how quickly the implementation converts integers to unicode.

0 replies

gvanrossum · 2021-10-12T19:06:32Z

gvanrossum
Oct 12, 2021
Maintainer

Where is bm_django.py? In PyPerformance I can only find bm_django_template.py.

0 replies

kmod · 2021-10-12T19:13:50Z

kmod
Oct 12, 2021

Ah sorry, it looks like at some point bm_django.py was renamed bm_django_template.py.

0 replies

gvanrossum · 2021-10-12T19:25:14Z

gvanrossum
Oct 12, 2021
Maintainer

@sbrunthaler

I can very well participate in such a chat and am able to run Teams. I'm on CE(S)T and booked for at least the next 12 days, so if a date after that works for the team that would be perfect.

Can you send email to guido@python.org so we can discuss a date and time?

0 replies

gvanrossum · 2021-12-08T18:06:11Z

gvanrossum
Dec 8, 2021
Maintainer

Here are the slides from Stefan's presentation.
20211025-kent.pdf

2 replies

ericsnowcurrently Dec 8, 2021
Maintainer

@sbrunthaler, could you upload the paper that goes with the talk? Thanks!

sbrunthaler Dec 9, 2021
Author

Yep, most definitely ;)

Here's the paper providing some more details: https://arxiv.org/abs/2109.02958
Regarding the use of a more efficient instruction en- & decoding, most of that (including performance evaluation) is in my PhD thesis. Should there be interest in that as well, please let me know.

lpereira · 2021-12-08T18:27:20Z

lpereira
Dec 8, 2021

FWIW: I wrote this as an experiment a few years back, and this could be useful here: https://gist.github.com/lpereira/3390dd11c17653d16049b505096a3f93

This allows one to emit direct calls to functions (x64 doesn't have a "call immediate-64" instruction, so you're usually forced to load the address into a register and do an indirect call), so we can generate a direct-threaded compiler for this very easily.

0 replies

This comment has been hidden.

Sign in to view

5x speedups without JIT compiler + portability... #148

Uh oh!

sbrunthaler Oct 8, 2021

Replies: 12 comments · 2 replies

Uh oh!

Uh oh!

gvanrossum Oct 8, 2021 Maintainer

Uh oh!

sbrunthaler Oct 8, 2021 Author

Uh oh!

markshannon Oct 11, 2021 Collaborator

This comment has been hidden.

Uh oh!

sbrunthaler Oct 12, 2021 Author

Uh oh!

sbrunthaler Oct 12, 2021 Author

Uh oh!

kmod Oct 12, 2021

Uh oh!

gvanrossum Oct 12, 2021 Maintainer

Uh oh!

kmod Oct 12, 2021

Uh oh!

gvanrossum Oct 12, 2021 Maintainer

Uh oh!

gvanrossum Dec 8, 2021 Maintainer

Uh oh!

ericsnowcurrently Dec 8, 2021 Maintainer

Uh oh!

sbrunthaler Dec 9, 2021 Author

Uh oh!

lpereira Dec 8, 2021

sbrunthaler
Oct 8, 2021

Replies: 12 comments 2 replies

gvanrossum
Oct 8, 2021
Maintainer

sbrunthaler
Oct 8, 2021
Author

markshannon
Oct 11, 2021
Collaborator

sbrunthaler
Oct 12, 2021
Author

sbrunthaler
Oct 12, 2021
Author

kmod
Oct 12, 2021

gvanrossum
Oct 12, 2021
Maintainer

kmod
Oct 12, 2021

gvanrossum
Oct 12, 2021
Maintainer

gvanrossum
Dec 8, 2021
Maintainer

ericsnowcurrently Dec 8, 2021
Maintainer

sbrunthaler Dec 9, 2021
Author

lpereira
Dec 8, 2021