-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter #27043
Conversation
I'll try to get some performance numbers tomorrow. |
I have some microbenchmark numbers (similar to @serhiy-storchaka 's from #9853). This is on a Mac with --enable-optimizations. LIST: TUPLE: DICT: STRING (NOT OPTIMIZED): TUPLE WITH NEGATIVE INDEX (NOT OPTIMIZED): In summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really an expert in this area. So I apologise in advance if some of my reviews aren't valid.
pyperformance results: baseline.jsonPerformance version: 1.0.2 opt.jsonPerformance version: 1.0.2 2to3Mean +- std dev: 361 ms +- 6 ms -> 347 ms +- 4 ms: 1.04x faster chameleonMean +- std dev: 10.7 ms +- 0.1 ms -> 10.4 ms +- 0.2 ms: 1.03x faster chaosMean +- std dev: 124 ms +- 1 ms -> 119 ms +- 1 ms: 1.04x faster crypto_pyaesMean +- std dev: 137 ms +- 4 ms -> 130 ms +- 2 ms: 1.05x faster deltablueMean +- std dev: 8.68 ms +- 0.16 ms -> 8.51 ms +- 0.21 ms: 1.02x faster django_templateMean +- std dev: 59.3 ms +- 3.3 ms -> 57.6 ms +- 2.8 ms: 1.03x faster dulwich_logMean +- std dev: 105 ms +- 2 ms -> 104 ms +- 4 ms: 1.01x faster fannkuchMean +- std dev: 561 ms +- 10 ms -> 537 ms +- 6 ms: 1.04x faster floatMean +- std dev: 127 ms +- 3 ms -> 125 ms +- 3 ms: 1.01x faster goMean +- std dev: 252 ms +- 4 ms -> 237 ms +- 4 ms: 1.06x faster hexiomMean +- std dev: 11.9 ms +- 0.1 ms -> 11.3 ms +- 0.3 ms: 1.06x faster json_dumpsMean +- std dev: 15.1 ms +- 0.2 ms -> 15.0 ms +- 0.3 ms: 1.01x faster json_loadsMean +- std dev: 29.3 us +- 0.3 us -> 28.8 us +- 0.4 us: 1.02x faster logging_formatMean +- std dev: 10.7 us +- 0.4 us -> 10.3 us +- 0.2 us: 1.04x faster logging_silentMean +- std dev: 231 ns +- 7 ns -> 221 ns +- 7 ns: 1.04x faster logging_simpleMean +- std dev: 9.78 us +- 0.30 us -> 9.62 us +- 0.22 us: 1.02x faster makoMean +- std dev: 17.1 ms +- 0.5 ms -> 16.5 ms +- 0.2 ms: 1.04x faster meteor_contestMean +- std dev: 114 ms +- 2 ms -> 111 ms +- 3 ms: 1.02x faster nbodyMean +- std dev: 171 ms +- 5 ms -> 158 ms +- 4 ms: 1.08x faster nqueensMean +- std dev: 124 ms +- 3 ms -> 115 ms +- 2 ms: 1.08x faster pathlibMean +- std dev: 46.6 ms +- 1.3 ms -> 45.3 ms +- 0.9 ms: 1.03x faster pickleMean +- std dev: 11.3 us +- 0.4 us -> 10.8 us +- 0.1 us: 1.05x faster pickle_dictMean +- std dev: 26.3 us +- 0.2 us -> 23.9 us +- 0.6 us: 1.10x faster pickle_listMean +- std dev: 4.14 us +- 0.12 us -> 3.88 us +- 0.05 us: 1.07x faster pickle_pure_pythonMean +- std dev: 538 us +- 10 us -> 518 us +- 8 us: 1.04x faster pidigitsMean +- std dev: 190 ms +- 3 ms -> 196 ms +- 1 ms: 1.03x slower pyflateMean +- std dev: 789 ms +- 16 ms -> 756 ms +- 10 ms: 1.04x faster python_startupMean +- std dev: 13.2 ms +- 0.3 ms -> 13.4 ms +- 0.5 ms: 1.02x slower python_startup_no_siteMean +- std dev: 9.73 ms +- 0.32 ms -> 9.58 ms +- 0.27 ms: 1.02x faster raytraceMean +- std dev: 611 ms +- 9 ms -> 581 ms +- 11 ms: 1.05x faster regex_dnaMean +- std dev: 183 ms +- 4 ms -> 194 ms +- 5 ms: 1.06x slower regex_effbotMean +- std dev: 3.36 ms +- 0.05 ms -> 3.46 ms +- 0.04 ms: 1.03x slower regex_v8Mean +- std dev: 25.8 ms +- 1.0 ms -> 25.9 ms +- 1.0 ms: 1.00x slower richardsMean +- std dev: 95.3 ms +- 1.5 ms -> 93.7 ms +- 1.4 ms: 1.02x faster scimark_fftMean +- std dev: 467 ms +- 5 ms -> 465 ms +- 6 ms: 1.01x faster scimark_luMean +- std dev: 214 ms +- 4 ms -> 209 ms +- 6 ms: 1.02x faster scimark_monte_carloMean +- std dev: 112 ms +- 2 ms -> 111 ms +- 3 ms: 1.01x faster scimark_sorMean +- std dev: 229 ms +- 3 ms -> 226 ms +- 7 ms: 1.02x faster scimark_sparse_mat_multMean +- std dev: 6.30 ms +- 0.26 ms -> 6.54 ms +- 0.19 ms: 1.04x slower spectral_normMean +- std dev: 192 ms +- 2 ms -> 189 ms +- 2 ms: 1.02x faster sqlalchemy_declarativeMean +- std dev: 161 ms +- 6 ms -> 159 ms +- 4 ms: 1.01x faster sqlalchemy_imperativeMean +- std dev: 24.2 ms +- 0.5 ms -> 24.2 ms +- 0.7 ms: 1.00x slower sqlite_synthMean +- std dev: 3.06 us +- 0.06 us -> 2.97 us +- 0.06 us: 1.03x faster sympy_expandMean +- std dev: 639 ms +- 10 ms -> 630 ms +- 9 ms: 1.02x faster sympy_integrateMean +- std dev: 27.0 ms +- 0.5 ms -> 26.2 ms +- 0.5 ms: 1.03x faster sympy_strMean +- std dev: 382 ms +- 4 ms -> 375 ms +- 5 ms: 1.02x faster sympy_sumMean +- std dev: 211 ms +- 6 ms -> 206 ms +- 2 ms: 1.02x faster telcoMean +- std dev: 7.21 ms +- 0.13 ms -> 7.09 ms +- 0.09 ms: 1.02x faster tornado_httpMean +- std dev: 174 ms +- 13 ms -> 180 ms +- 48 ms: 1.03x slower unpack_sequenceMean +- std dev: 59.7 ns +- 0.6 ns -> 55.9 ns +- 1.2 ns: 1.07x faster unpickleMean +- std dev: 16.1 us +- 0.3 us -> 16.1 us +- 0.3 us: 1.00x faster unpickle_listMean +- std dev: 5.02 us +- 0.14 us -> 5.06 us +- 0.16 us: 1.01x slower unpickle_pure_pythonMean +- std dev: 382 us +- 6 us -> 364 us +- 4 us: 1.05x faster xml_etree_generateMean +- std dev: 108 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x faster xml_etree_iterparseMean +- std dev: 116 ms +- 3 ms -> 117 ms +- 3 ms: 1.00x slower xml_etree_parseMean +- std dev: 162 ms +- 5 ms -> 164 ms +- 5 ms: 1.01x slower xml_etree_processMean +- std dev: 88.4 ms +- 2.2 ms -> 85.2 ms +- 0.9 ms: 1.04x faster Skipped 1 benchmarks only in baseline.json: regex_compile |
Doesn't pyperf[ormance] have a flag to sort the benchmarks from "most faster" to "most slower" and to skip the "not significant" ones? |
Yeah it's Edit: Here's the docs: https://pyperf.readthedocs.io/en/latest/cli.html#pyperf-compare-to |
@iritkatriel hmm, I just noticed something strange. Some of the speedups make sense (like nqueens which has lots of |
./python.exe -m pyperf compare_to baseline.json ../cpython/opt.json -G
Faster (43):
Benchmark hidden because not significant (9): dulwich_log, regex_v8, sqlalchemy_declarative, sqlalchemy_imperative, tornado_http, unpickle, unpickle_list, xml_etree_parse, xml_etree_iterparse Geometric mean: 1.02x faster |
I ran the benchmarks on the "faster cpython" team's benchmark machine (which sits in a secret location :-) and got the following results, comparing this PR with the point where it branched off main (two commits behind). This primarily goes to show that there's a lot of noise in the data...
Slower (10):
Faster (12):
Benchmark hidden because not significant (36): 2to3, chameleon, deltablue, django_template, dulwich_log, float, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, pickle, pickle_dict, pickle_list, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process Geometric mean: 1.00x faster |
PS. I did not use --pgo when building. That might affect things. |
I repeated it and got this. I configured with --with-optimizations (Mac)
|
Huh, I guess clang (Mac) is better than gcc (Linux), or more stable or something. I had not used pgo/lto, and when I turned that on I got similar results only more pronounced, but still with a geometric mean of 1.00, so I won't repeat the numbers here. (It wasn't the always same benchmarks that were slower/faster, although the top 5 or so on either were roughly the same.) I don't really understand what's going on with benchmarks, maybe @vstinner has some insights? Are they just very noisy? I have a feeling that if we pay them too much attention we'll never make progress. Perhaps we should focus more on micro-benchmarks, and use PyPerformance to prove that we're not making things worse. Although I'd like to understand why unpack_sequence is 1.17x slower for me with pgo+lto. (BTW, pgo+lto really does make a big difference. Comparing this PR without and with lto+pgo gives a geometric mean of 1.21x faster, with the extreme being richards at 1.46x. Puts our efforts in perspective. :-) |
Huh, lto+pgo makes unpack_sequence 1.10x slower on your branch. And on main they make it only 1.04x faster. So there's something weird about it. After all, it's 500 lines of this: a, b, c, d, e, f, g, h, i, j = to_unpack |
Ah, benchmarking Python! I suggest to only measure Python performance when it's built with PGO+LTO. On Linux, pyperf is more reliable using CPU isolation (see pyperf system tune command). Usually, I ignore differences smaller than 5%. I treat them as "noise" in the benchmark itself, rather than in the change. To understand that, you can try to build Python 5 times and each time measure performances. In my experience, each PGO build gives performances a little bit different because the PROFILE_TASK in Makefile is not fully determistic. In general, the Python build is not deterministic.
I added this feature recently in pyperf. So I don't have a good experience with it. But I'm not impressed by "Geometric mean: 1.00x faster": the change doesn't seem to make Python faster nor slower. The geometric mean doesn't put weights on benchmarks. Each benchmark is treated equally.
I'm also trying to not pay too much attention to speedup, but mostly check if the mandatory slowdowns are acceptable or not. I never see a pyperformance result with no "slower" benchmark. It's always a balance. That's why I added the geometric mean. More about pyperformance individual benchmarks: https://pyperformance.readthedocs.io/benchmarks.html#available-benchmarks |
This benchmark is a micro-benchmark. It was never reliable. I suggest to remove it. |
See unpack sequence at speed.python.org: results over only 1 month. The differences are quite significan. Timings are between 115 ms and 140 ms. IMO it's more a benchmark on the CPU L1 instruction cache: it depends on how the compiler decides the organize the Python code machine using PGO. It depends a lot on code placement. See https://vstinner.github.io/analysis-python-performance-issue.html for an example of analysis on code placement. |
By the way, If the "python" executable is a small binary (ex: 20 KB) which only calls Py_BytesMain() which lives in libpython dynamic library, function calls from libpython to libpython (Python calling itself its own C functions) can be slower because of the semantic interposition (ability to override symbols at runtime using LD_PRELOAD or whatever else). Using See my analysis of How Python is built makes a huge difference in performances. Note: clang uses |
The key is as well running the benchmarks with CPU isolation or otherwise the noise is going to be in the range of 2-6% or even more as the OS can schedule any process in the CPU you are using for benchmarking or even schedule system interrupts and other RC there. |
I repeated these benchmarks on my machine with PGO/LTO and CPU isolation and I get very similar results as these ones. I will try to post them later. |
I get the feeling but macro benchmarks are the only way to really claim general speedups. These are also the benchmarks we show in speed.python.org and the ones we have been using when doing similar work before to decide if it was worth it or not. It is also is the metric that every other implementation is using to claim that is "X times faster than CPython". |
There are useful numbers that can be gathered that should be deterministic. What fraction of total executed instruction is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just needs a few minor changes.
The WIndows compiler doesn't seem to like the reference to |
I fixed it locally with "#include "pycore_long.h"", but I can do this instead. |
If you've already fixed it, then that's fine. No need to change it again. |
I have made the requested changes; please review again. |
Thanks for making the requested changes! @markshannon: please review the changes made to this pull request. |
/python.exe -m pyperf compare_to -G baseline.json opt.json
Faster (32):
Benchmark hidden because not significant (14): 2to3, deltablue, dulwich_log, json_dumps, mako, pickle_list, regex_compile, sqlalchemy_declarative, sqlalchemy_imperative, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate Geometric mean: 1.02x faster |
Micro Benchmarks:
|
@@ -310,17 +310,18 @@ too_many_cache_misses(_PyAdaptiveEntry *entry) { | |||
return entry->counter == saturating_zero(); | |||
} | |||
|
|||
#define BACKOFF 64 | |||
#define ADAPTIVE_CACHE_BACKOFF 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Two very minor tweaks, otherwise ready to merge 🙂 |
Excellent. |
bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter (pythonGH-27043)
https://bugs.python.org/issue26280