gh-119118: Fix performance regression in tokenize module #119615

lysnikolaou · 2024-05-27T16:11:00Z

Cache line object to avoid creating a Unicode object for all of the tokens in the same line.
Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

Issue: tokenize.generate_tokens() performance regression in 3.12 #119118

- Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

Python/Python-tokenize.c

pablogsal · 2024-05-28T13:13:39Z

Hummm, seems also that this solution fails with test_tokenize with -uall:

======================================================================
ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_difflib.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip
    tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer
    raise e from None
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer
    for info in it:
        yield TokenInfo._make(info)
  File "<string>", line 467
    dateb = '3 fév'
                   ^
IndentationError: unindent does not match any outer indentation level

======================================================================
ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_html.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip
    tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer
    raise e from None
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer
    for info in it:
        yield TokenInfo._make(info)
  File "<string>", line 85
    check('&notin;', '∉')
                         ^
IndentationError: unindent does not match any outer indentation level

======================================================================
ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_str.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip
    tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer
    raise e from None
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer
    for info in it:
        yield TokenInfo._make(info)
  File "<string>", line 1008
    try:
        ^
IndentationError: unindent does not match any outer indentation level

----------------------------------------------------------------------

pablogsal · 2024-05-28T13:37:46Z

Another idea: we do have the token already as unicode (str) so if the line has not changed we can keep adding the size of the token itself to our state and take into account the number of whitespace characters between the tokens (but this is always ASCII so we can basically just calculate as the diff of two pointers).

pablogsal · 2024-05-28T14:08:12Z

I'm discussing another fix with @lysnikolaou offline

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal · 2024-05-28T15:55:44Z

The docs failure seems unrelated. @hugovk do we know what may be happening here?

lysnikolaou · 2024-05-28T17:41:35Z

The latest results are:

cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 
❯ python tmp/t.py
cpython darwin 3.11.3 (main, May  8 2023, 13:16:43) [Clang 14.0.3 (clang-1403.0.22.14.1)]
Time taken: 0.5428769588470459

cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 
❯ ./python.exe tmp/t.py
cpython darwin 3.14.0a0 (heads/performance-tokenize-dirty:ab3437096a7, May 28 2024, 19:26:19) [Clang 15.0.0 (clang-1500.3.9.4)]
Time taken: 0.4570140838623047

The test failures appear to be unrelated. We can probably merge this.

hugovk · 2024-05-28T18:16:08Z

The docs failure seems unrelated. @hugovk do we know what may be happening here?

Yep, need to merge in latest main, the new option has been added there (re: #119221).

miss-islington-app · 2024-05-28T19:17:53Z

Thanks @lysnikolaou for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

bedevere-app · 2024-05-28T19:18:05Z

GH-119682 is a backport of this pull request to the 3.13 branch.

…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

bedevere-app · 2024-05-28T19:18:12Z

GH-119683 is a backport of this pull request to the 3.12 branch.

…19615) (#119682) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…19615) (#119683) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…n#119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

pythongh-119118: Fix performance regression in tokenize module

66f9385

- Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

bedevere-app bot added the awaiting core review label May 27, 2024

bedevere-app bot mentioned this pull request May 27, 2024

tokenize.generate_tokens() performance regression in 3.12 #119118

Closed

devdanzin mentioned this pull request May 28, 2024

tokenize in 3.12 makes copies of each line, 3.11 does not #119654

Closed

lysnikolaou requested a review from pablogsal May 28, 2024 12:12

📜🤖 Added by blurb_it.

143a420

pablogsal reviewed May 28, 2024

View reviewed changes

Python/Python-tokenize.c Outdated Show resolved Hide resolved

Update Python/Python-tokenize.c

8318ab8

pablogsal reviewed May 28, 2024

View reviewed changes

Python/Python-tokenize.c Outdated Show resolved Hide resolved

pablogsal reviewed May 28, 2024

View reviewed changes

Python/Python-tokenize.c Outdated Show resolved Hide resolved

WIP: strcmp lines (need to fix) and use line buffer for end col offset

566adc3

lysnikolaou and others added 2 commits May 28, 2024 16:19

Get rid of strcmp

858eff1

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

Make NEWS linter happy

ab34370

pablogsal approved these changes May 28, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels May 28, 2024

pablogsal added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels May 28, 2024

pablogsal approved these changes May 28, 2024

View reviewed changes

Merge branch 'main' into performance-tokenize

b97c51e

pablogsal enabled auto-merge (squash) May 28, 2024 18:19

pablogsal merged commit d87b015 into python:main May 28, 2024
36 checks passed

bedevere-app bot removed the awaiting merge label May 28, 2024

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label May 28, 2024

bedevere-app bot removed the needs backport to 3.12 only security fixes label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-119118: Fix performance regression in tokenize module #119615

gh-119118: Fix performance regression in tokenize module #119615

Uh oh!

lysnikolaou commented May 27, 2024 •

edited by bedevere-app bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024 •

edited

Loading

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

lysnikolaou commented May 28, 2024

Uh oh!

hugovk commented May 28, 2024

Uh oh!

Uh oh!

miss-islington-app bot commented May 28, 2024

Uh oh!

bedevere-app bot commented May 28, 2024

Uh oh!

bedevere-app bot commented May 28, 2024

Uh oh!

Uh oh!

Uh oh!

gh-119118: Fix performance regression in tokenize module #119615

gh-119118: Fix performance regression in tokenize module #119615

Uh oh!

Conversation

lysnikolaou commented May 27, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

pablogsal commented May 28, 2024

Uh oh!

lysnikolaou commented May 28, 2024

Uh oh!

hugovk commented May 28, 2024

Uh oh!

Uh oh!

miss-islington-app bot commented May 28, 2024

Uh oh!

bedevere-app bot commented May 28, 2024

Uh oh!

bedevere-app bot commented May 28, 2024

Uh oh!

Uh oh!

lysnikolaou commented May 27, 2024 •

edited by bedevere-app bot

Loading

pablogsal commented May 28, 2024 •

edited

Loading