5x speedups without JIT compiler + portability... #148
Replies: 12 comments 2 replies
-
Wow, thanks! Your work was actually (part of) our inspiration. Another speedup project, Pyston, even mentioned your original bpo issue, bpo-14757. I don’t know why we don’t yet see such dramatic improvements, but it may be because there are already so many other optimizations — IIRC you worked on Python 3.3. |
Beta Was this translation helpful? Give feedback.
-
I've seen the "old" patch set from 2012. AFAIR, this is mostly just inline caching of interpreter instructions, including some nifty caching of LOAD_GLOBAL/LOAD_ATTR. The link to the new paper (was unfortunately never accepted anywhere) does some more aggressive optimizations, which is where the speedups come from.
In combination, both techniques allow for substantial speedups. Of course this depends also on the speedup potential of the relevant benchmarks. In the linked paper, I also did some analysis to approximate the overhead of interpretation on specific programs and found that the technique eliminates most of this overhead. Even doing this analysis on prospective benchmarks a priori would be highly relevant, as this serves as a good guideline to how good the optimizations are performing. If I can be of any help to your efforts, let me know. Also happy to do a VC! |
Beta Was this translation helpful? Give feedback.
-
@sbrunthaler Did you benchmark your work on any larger benchmarks than the ones in your paper? I'd be intrigued to see how your interpreter performs on the pyperformance benchmark suite (doesn't need to be all of them, just those you can get to run). Also, did you maintain support for |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
That were pretty much all the benchmarks available to compare performance against other techniques. Regarding type stability, one of our grad students at UCI did an analysis on several programs and found that type stability was closer to 99% compared to the relevant value cited from Baden's '82 paper (95%). That notwithstanding, if type stability is an issue with a program, neither the JIT compiler nor the optimizing interpreter will be tremendously successful. (Maybe the interpreter will be a little bit faster overall, since no code gen and the low latency optimization effort, plus no maintenance of a code cache.)
I'll take a look, but need to dig up a working system, so that will take a while 😉
Nope, I did not. I briefly looked at it yesterday evening and see no problem in supporting this. If remember that there were some hooks in the CPython interpreter that could be executed before or after each interpreter instruction, if this is needed for supporting |
Beta Was this translation helpful? Give feedback.
-
I can very well participate in such a chat and am able to run Teams. I'm on CE(S)T and booked for at least the next 12 days, so if a date after that works for the team that would be perfect. |
Beta Was this translation helpful? Give feedback.
-
fwiw bm_django.py is a test of how quickly the implementation converts integers to unicode. |
Beta Was this translation helpful? Give feedback.
-
Where is bm_django.py? In PyPerformance I can only find bm_django_template.py. |
Beta Was this translation helpful? Give feedback.
-
Ah sorry, it looks like at some point bm_django.py was renamed bm_django_template.py. |
Beta Was this translation helpful? Give feedback.
-
Can you send email to guido@python.org so we can discuss a date and time? |
Beta Was this translation helpful? Give feedback.
-
Here are the slides from Stefan's presentation. |
Beta Was this translation helpful? Give feedback.
-
FWIW: I wrote this as an experiment a few years back, and this could be useful here: https://gist.github.com/lpereira/3390dd11c17653d16049b505096a3f93 This allows one to emit direct calls to functions (x64 doesn't have a "call immediate-64" instruction, so you're usually forced to load the address into a register and do an indirect call), so we can generate a direct-threaded compiler for this very easily. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi guys,
one of my students just pointed me at your repository and I just wanted to briefly touch base and highlight some of my own prior work/research in optimizing the CPython interpreter. Pretty much ten years ago, I did extensive research on purely-interpretative optimizations, mostly through inline caching with quickening. Subsequent research on combining multiple different techniques led to maximum speedups of 5.5x, without requiring a JIT compiler. AFAICT, my research would ideally suited for stages 1 and 2 of your implementation plan. Due to some interest expressed on this research on Twitter about three months ago, I put together the paper and the many rejections it got in academia. If you're interested, please take a look: https://arxiv.org/abs/2109.02958
Based on my experience, it should be possible to obtain much of the proposed speedups by focusing just on the interpreter-based optimizations. A simple JIT is, IMHO, not going to provide more speedups beyond the interpreter. (Similar to having a template JIT vs an optimizing interpreter, as the template JIT mostly targets instruction dispatch costs, which are usually not dominant costs in Python.) A nice benefit is that this strategy maintains portability, and could even be used to provide optimizations across extensions written in C (numpy, etc.).
If I can provide any further explanations beyond the paper, please do let me know!
Other than that: Have a nice day and weekend, respectively & all the best from Munich,
--stefan
Beta Was this translation helpful? Give feedback.
All reactions