Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43754: Eliminate bindings for partial pattern matches #25229

Merged
merged 57 commits into from
May 2, 2021

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Apr 6, 2021

This draft PR contains a couple of pattern matching improvements I'm experimenting with. Most significantly, it implements behavior that several people requested during the PEP 634 feedback process: no more name bindings for partial matches.

(One caveat: all names are bound before evaluating the corresponding guard. If the guard fails, these bindings won't be rolled back before matching the next case.)

In short, this is accomplished by storing captured objects on the stack underneath the more immediately important stuff, popping them all on failure or storing them all only once the entire pattern for that case matches. The control flow needs to be greatly simplified in order to make this work correctly, so the patch replaces the current "push bools and jump if false a bunch of times" flow with something much more efficient: jumping straight into a run of POP_TOPs on failure. We already do something very similar to this in pattern_helper_sequence_unpack (the comments in the current code explain it well).

Parts of the new code are a bit more complex than before (especially or-patterns, which need to do a bit of juggling on the stack to reorder captured items and avoid popping important data on failure), and other parts can certainly still be improved... but I personally consider it a win to have more intuitive partial match behavior with no performance penalty.

https://bugs.python.org/issue43754

@brandtbucher brandtbucher requested a review from gvanrossum April 30, 2021 04:39
@brandtbucher
Copy link
Member Author

@pablogsal, @markshannon, @gvanrossum: how do you feel about this change?

I plan on cleaning this up more in the coming weeks, but I'd really like to get this initial rewrite into 3.10, if possible. Two main reasons:

  • We can avoid changing the behavior of partial matches between 3.10 and 3.11 (we're technically free to do this, but I'd really like to avoid it if possible).
  • We can keep the 3.10 and 3.11 pattern matching implementations similar, so it's easier to backport the inevitible bug fixes we'll have in early releases.

I know it's a big ask this late in the release cycle (I needed to wait for the AST redesign to finish), but I don't want to merge this without at least one approving review. I'd really appreciate any comments, or even just your thoughts on whether this is a good idea or not.

@brandtbucher
Copy link
Member Author

Also... since this replaces the ROT_* instructions with a new ROTATE instruction, I went ahead and ran the pyperformance suite about a week ago. The net impact looks like a wash (geometric mean of exactly 1, no benchmarks moved more than 6%):

$ python -m pyperf compare_to ~/pyperformance/json/*.json.gz --table --group-by-speed
+-------------------------+--------------------------------------+-------------------------------------------+
| Benchmark               | 2021-04-23_18-03-master-e047239eafef | 2021-04-24_23-41-patma-names-8e25f2679fb1 |
+=========================+======================================+===========================================+
| json_loads              | 50.8 us                              | 48.6 us: 1.05x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle_list             | 6.98 us                              | 6.71 us: 1.04x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| sqlite_synth            | 5.24 us                              | 5.06 us: 1.04x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| logging_format          | 18.4 us                              | 17.9 us: 1.03x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pathlib                 | 35.7 ms                              | 34.8 ms: 1.03x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| deltablue               | 13.2 ms                              | 12.9 ms: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| django_template         | 88.1 ms                              | 86.2 ms: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle_dict             | 45.0 us                              | 44.1 us: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle                  | 16.1 us                              | 15.9 us: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| sqlalchemy_imperative   | 57.4 ms                              | 56.6 ms: 1.01x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| json_dumps              | 22.7 ms                              | 22.4 ms: 1.01x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| tornado_http            | 272 ms                               | 269 ms: 1.01x faster                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpickle                | 23.1 us                              | 22.9 us: 1.01x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| python_startup_no_site  | 10.1 ms                              | 10.1 ms: 1.00x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| raytrace                | 845 ms                               | 850 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| float                   | 181 ms                               | 182 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| xml_etree_generate      | 163 ms                               | 165 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| sympy_sum               | 410 ms                               | 415 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_monte_carlo     | 183 ms                               | 186 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpack_sequence         | 97.0 ns                              | 98.4 ns: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpickle_pure_python    | 521 us                               | 529 us: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| chameleon               | 16.0 ms                              | 16.2 ms: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| go                      | 397 ms                               | 404 ms: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_sparse_mat_mult | 8.45 ms                              | 8.62 ms: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| mako                    | 26.2 ms                              | 26.7 ms: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle_pure_python      | 814 us                               | 830 us: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| chaos                   | 186 ms                               | 190 ms: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| crypto_pyaes            | 205 ms                               | 210 ms: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| telco                   | 11.4 ms                              | 11.7 ms: 1.03x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| hexiom                  | 16.6 ms                              | 17.0 ms: 1.03x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| richards                | 129 ms                               | 132 ms: 1.03x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| sympy_integrate         | 46.1 ms                              | 47.4 ms: 1.03x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| regex_effbot            | 5.18 ms                              | 5.38 ms: 1.04x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pidigits                | 326 ms                               | 340 ms: 1.04x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| nbody                   | 239 ms                               | 251 ms: 1.05x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_sor             | 339 ms                               | 358 ms: 1.06x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| Geometric mean          | (ref)                                | 1.00x slower                              |
+-------------------------+--------------------------------------+-------------------------------------------+

Benchmark hidden because not significant (24): logging_silent, regex_v8, unpickle_list, dulwich_log, pyflate, scimark_fft, sympy_expand, xml_etree_parse, logging_simple, 2to3, spectral_norm, regex_dna, scimark_lu, python_startup, nqueens, genshi_xml, sympy_str, regex_compile, meteor_contest, sqlalchemy_declarative, xml_etree_process, fannkuch, genshi_text, xml_etree_iterparse

@pablogsal
Copy link
Member

Also... since this replaces the ROT_* instructions with a new ROTATE instruction, I went ahead and ran the pyperformance suite about a week ago. The net impact looks like a wash (geometric mean of exactly 1, no benchmarks moved more than 6%):

$ python -m pyperf compare_to ~/pyperformance/json/*.json.gz --table --group-by-speed
+-------------------------+--------------------------------------+-------------------------------------------+
| Benchmark               | 2021-04-23_18-03-master-e047239eafef | 2021-04-24_23-41-patma-names-8e25f2679fb1 |
+=========================+======================================+===========================================+
| json_loads              | 50.8 us                              | 48.6 us: 1.05x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle_list             | 6.98 us                              | 6.71 us: 1.04x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| sqlite_synth            | 5.24 us                              | 5.06 us: 1.04x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| logging_format          | 18.4 us                              | 17.9 us: 1.03x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pathlib                 | 35.7 ms                              | 34.8 ms: 1.03x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| deltablue               | 13.2 ms                              | 12.9 ms: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| django_template         | 88.1 ms                              | 86.2 ms: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle_dict             | 45.0 us                              | 44.1 us: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle                  | 16.1 us                              | 15.9 us: 1.02x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| sqlalchemy_imperative   | 57.4 ms                              | 56.6 ms: 1.01x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| json_dumps              | 22.7 ms                              | 22.4 ms: 1.01x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| tornado_http            | 272 ms                               | 269 ms: 1.01x faster                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpickle                | 23.1 us                              | 22.9 us: 1.01x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| python_startup_no_site  | 10.1 ms                              | 10.1 ms: 1.00x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| raytrace                | 845 ms                               | 850 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| float                   | 181 ms                               | 182 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| xml_etree_generate      | 163 ms                               | 165 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| sympy_sum               | 410 ms                               | 415 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_monte_carlo     | 183 ms                               | 186 ms: 1.01x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpack_sequence         | 97.0 ns                              | 98.4 ns: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpickle_pure_python    | 521 us                               | 529 us: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| chameleon               | 16.0 ms                              | 16.2 ms: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| go                      | 397 ms                               | 404 ms: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_sparse_mat_mult | 8.45 ms                              | 8.62 ms: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| mako                    | 26.2 ms                              | 26.7 ms: 1.02x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle_pure_python      | 814 us                               | 830 us: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| chaos                   | 186 ms                               | 190 ms: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| crypto_pyaes            | 205 ms                               | 210 ms: 1.02x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| telco                   | 11.4 ms                              | 11.7 ms: 1.03x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| hexiom                  | 16.6 ms                              | 17.0 ms: 1.03x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| richards                | 129 ms                               | 132 ms: 1.03x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| sympy_integrate         | 46.1 ms                              | 47.4 ms: 1.03x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| regex_effbot            | 5.18 ms                              | 5.38 ms: 1.04x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| pidigits                | 326 ms                               | 340 ms: 1.04x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| nbody                   | 239 ms                               | 251 ms: 1.05x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_sor             | 339 ms                               | 358 ms: 1.06x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| Geometric mean          | (ref)                                | 1.00x slower                              |
+-------------------------+--------------------------------------+-------------------------------------------+

Benchmark hidden because not significant (24): logging_silent, regex_v8, unpickle_list, dulwich_log, pyflate, scimark_fft, sympy_expand, xml_etree_parse, logging_simple, 2to3, spectral_norm, regex_dna, scimark_lu, python_startup, nqueens, genshi_xml, sympy_str, regex_compile, meteor_contest, sqlalchemy_declarative, xml_etree_process, fannkuch, genshi_text, xml_etree_iterparse

Hummm, still the standard deviation is quite high: 6% slower is at the very least suspicious. I will try to reproduce this in the speed.python.org server and my benchmark machine today or tomorrow.

@pablogsal
Copy link
Member

pablogsal commented Apr 30, 2021

These are my results:

pyperf compare_to json/* --table --min-speed 3 --table --group-by-speed
+-------------------------+--------------------------------------+-------------------------------------------+
| Benchmark               | 2021-03-29_18-51-master-e920fbae4661 | 2021-04-30_17-09-patma-names-3f81effcbb0c |
+=========================+======================================+===========================================+
| scimark_sparse_mat_mult | 6.34 ms                              | 5.81 ms: 1.09x faster                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_fft             | 495 ms                               | 473 ms: 1.05x faster                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| nbody                   | 159 ms                               | 164 ms: 1.03x slower                      |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpack_sequence         | 69.1 ns                              | 71.5 ns: 1.03x slower                     |
+-------------------------+--------------------------------------+-------------------------------------------+
| Geometric mean          | (ref)                                | 1.00x slower                              |
+-------------------------+--------------------------------------+-------------------------------------------+

Benchmark hidden because not significant (55): chameleon, telco, genshi_xml, json_loads, scimark_lu, xml_etree_generate, meteor_contest, json_dumps, xml_etree_process, logging_silent, go, float, mako, crypto_pyaes, 2to3, pickle_list, sqlite_synth, regex_v8, tornado_http, dulwich_log, scimark_monte_carlo, spectral_norm, logging_simple, regex_compile, pickle_dict, logging_format, pickle, unpickle_pure_python, regex_effbot, hexiom, fannkuch, regex_dna, pyflate, genshi_text, pidigits, sqlalchemy_declarative, sympy_integrate, pickle_pure_python, sqlalchemy_imperative, unpickle, chaos, django_template, richards, sympy_str, raytrace, sympy_sum, deltablue, xml_etree_iterparse, python_startup, xml_etree_parse, sympy_expand, python_startup_no_site, scimark_sor, pathlib, nqueens

This looks more reasonable (they are executed with CPU isolation so I assume that's why they are more stable).

@brandtbucher
Copy link
Member Author

brandtbucher commented Apr 30, 2021

Thanks @pablogsal!

That's interesting... I used CPU isolation too. I wonder why mine was so much more unstable.

@pablogsal
Copy link
Member

Unfortunately, I don't know if I will have time to go through this PR before next week because everything is slighly on fire. It would be great if @markshannon or @serhiy-storchaka could give it a look.

I checked and at least we are refleak free:

❯ ./python -m test test_patma test_compile -R :
0:00:00 load avg: 1.65 Run tests sequentially
0:00:00 load avg: 1.65 [1/2] test_patma
beginning 9 repetitions
123456789
.........
0:00:00 load avg: 1.65 [2/2] test_compile
beginning 9 repetitions
123456789
.........
test_compile passed in 39.8 sec

== Tests result: SUCCESS ==

All 2 tests OK.

Total duration: 40.6 sec
Tests result: SUCCESS

@brandtbucher
Copy link
Member Author

No problem @pablogsal. Feel free to nosy me on anything I might be able to help with.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the general idea. My concerns are with the bytecode changes.

Please leave ROT_TWO, ROT_THREE and ROT_FOUR.
Fixed sized stack manipulations are a lot more efficient: https://godbolt.org/z/97qYzY9Pj

The vast majority of ROTATEs can be translated into one of the above.

Python/compile.c Outdated

// Build all of the fail_pop blocks and reset fail_pop.
static int
cleanup_fail_pop(struct compiler *c, pattern_context *pc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd break this into two, emitting and freeing. cleanup is ambiguous here, it could be emitting cleanup code or cleaning up the allocated memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since the callers allocate the memory, maybe just do the emitting here, and let the callers free the memory?

Copy link
Member Author

@brandtbucher brandtbucher May 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure what you mean by "the callers allocate the memory". Allocation of fail_pop happens in ensure_fail_pop, which itself is almost always called by jump_to_fail_pop. jump_to_fail_pop typically gets called many times, often very far from where cleanup_fail_pop is called.

I'd prefer to keep these steps (emitting and freeing the fail_pop blocks) together. They are very tightly coupled: we never want to free them without emitting them first, and we never want to emit them without freeing them immediately after. Plus, the freeing part is just two lines that cannot fail.

Regarding the name, how about emit_and_reset_fail_pop? (I like "reset" better than "free" since we're also zeroing fail_pop_size).

@brandtbucher
Copy link
Member Author

brandtbucher commented May 2, 2021

Thanks for the review @markshannon! I've:

  • Added the ROT_* ops back. The peephole optimizer will replace ROT_N with these when possible.
  • Renamed the ROTATE op to ROT_N.
  • Renamed cleanup_fail_pop to emit_and_reset_fail_pop.

I also made a tweak to no longer store names in reverse order, since I noticed that this interacted badly with sequence patterns:

match ...:
    case a, b, c:
        pass

Before:

  2           0 LOAD_CONST               0 (Ellipsis)

  3           2 MATCH_SEQUENCE
              4 POP_JUMP_IF_FALSE       15 (to 30)
              6 GET_LEN
              8 LOAD_CONST               1 (3)
             10 COMPARE_OP               2 (==)
             12 POP_JUMP_IF_FALSE       15 (to 30)
             14 UNPACK_SEQUENCE          3
             16 ROT_THREE
             18 ROT_TWO
             20 STORE_NAME               0 (c)
             22 STORE_NAME               1 (b)
             24 STORE_NAME               2 (a)

  4          26 LOAD_CONST               2 (None)
             28 RETURN_VALUE
        >>   30 POP_TOP
             32 LOAD_CONST               2 (None)
             34 RETURN_VALUE

After:

  2           0 LOAD_CONST               0 (Ellipsis)

  3           2 MATCH_SEQUENCE
              4 POP_JUMP_IF_FALSE       13 (to 26)
              6 GET_LEN
              8 LOAD_CONST               1 (3)
             10 COMPARE_OP               2 (==)
             12 POP_JUMP_IF_FALSE       13 (to 26)
             14 UNPACK_SEQUENCE          3
             16 STORE_NAME               0 (a)
             18 STORE_NAME               1 (b)
             20 STORE_NAME               2 (c)

  4          22 LOAD_CONST               2 (None)
             24 RETURN_VALUE
        >>   26 POP_TOP
             28 LOAD_CONST               2 (None)
             30 RETURN_VALUE

@brandtbucher
Copy link
Member Author

If nobody raises any other concerns with this PR before noon tomorrow (PT), I'll go ahead and land this.

(I didn't see any refleaks locally, but I'm going to sic the buildbots on this anyways.)

@brandtbucher brandtbucher added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 2, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @brandtbucher for commit 99f0ec6 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 2, 2021
@markshannon
Copy link
Member

👍 for merging.
I have concerns about the efficiency of all the stack shuffling, but we can address that later.

@brandtbucher
Copy link
Member Author

brandtbucher commented May 2, 2021

The three s390x refleak buildbot failures look unrelated (test_asyncio timeouts).

@brandtbucher brandtbucher merged commit 0ad1e03 into python:master May 2, 2021
@python python deleted a comment from Stirfry70 May 8, 2021
@python python deleted a comment from Stirfry70 May 8, 2021
@brandtbucher brandtbucher deleted the patma-names branch July 21, 2022 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants