Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-119118: Fix performance regression in tokenize module #119615

Merged
merged 7 commits into from
May 28, 2024

Conversation

lysnikolaou
Copy link
Contributor

@lysnikolaou lysnikolaou commented May 27, 2024

  • Cache line object to avoid creating a Unicode object for all of the tokens in the same line.
  • Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.
Python/Python-tokenize.c Outdated Show resolved Hide resolved
Python/Python-tokenize.c Outdated Show resolved Hide resolved
Python/Python-tokenize.c Outdated Show resolved Hide resolved
@pablogsal
Copy link
Member

Hummm, seems also that this solution fails with test_tokenize with -uall:

======================================================================
ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_difflib.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip
    tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer
    raise e from None
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer
    for info in it:
        yield TokenInfo._make(info)
  File "<string>", line 467
    dateb = '3 fév'
                   ^
IndentationError: unindent does not match any outer indentation level

======================================================================
ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_html.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip
    tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer
    raise e from None
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer
    for info in it:
        yield TokenInfo._make(info)
  File "<string>", line 85
    check('&notin;', '∉')
                         ^
IndentationError: unindent does not match any outer indentation level

======================================================================
ERROR: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/Users/pgalindo3/github/python/main/Lib/test/test_str.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1959, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "/Users/pgalindo3/github/python/main/Lib/test/test_tokenize.py", line 1827, in check_roundtrip
    tokens2_from5 = [tok[:2] for tok in tokenize.tokenize(readline5)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 484, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 578, in _generate_tokens_from_c_tokenizer
    raise e from None
  File "/Users/pgalindo3/github/python/main/Lib/tokenize.py", line 574, in _generate_tokens_from_c_tokenizer
    for info in it:
        yield TokenInfo._make(info)
  File "<string>", line 1008
    try:
        ^
IndentationError: unindent does not match any outer indentation level

----------------------------------------------------------------------

@pablogsal
Copy link
Member

pablogsal commented May 28, 2024

Another idea: we do have the token already as unicode (str) so if the line has not changed we can keep adding the size of the token itself to our state and take into account the number of whitespace characters between the tokens (but this is always ASCII so we can basically just calculate as the diff of two pointers).

@pablogsal
Copy link
Member

I'm discussing another fix with @lysnikolaou offline

lysnikolaou and others added 2 commits May 28, 2024 16:19
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
@pablogsal
Copy link
Member

The docs failure seems unrelated. @hugovk do we know what may be happening here?

@lysnikolaou
Copy link
Contributor Author

The latest results are:

cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 
❯ python tmp/t.py
cpython darwin 3.11.3 (main, May  8 2023, 13:16:43) [Clang 14.0.3 (clang-1403.0.22.14.1)]
Time taken: 0.5428769588470459

cpython on  performance-tokenize [$] via C v15.0.0-clang via 🐍 v3.11.3 
❯ ./python.exe tmp/t.py
cpython darwin 3.14.0a0 (heads/performance-tokenize-dirty:ab3437096a7, May 28 2024, 19:26:19) [Clang 15.0.0 (clang-1500.3.9.4)]
Time taken: 0.4570140838623047

The test failures appear to be unrelated. We can probably merge this.

@hugovk
Copy link
Member

hugovk commented May 28, 2024

The docs failure seems unrelated. @hugovk do we know what may be happening here?

Yep, need to merge in latest main, the new option has been added there (re: #119221).

@pablogsal pablogsal added needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels May 28, 2024
@pablogsal pablogsal enabled auto-merge (squash) May 28, 2024 18:19
@pablogsal pablogsal merged commit d87b015 into python:main May 28, 2024
36 checks passed
@miss-islington-app
Copy link

Thanks @lysnikolaou for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 28, 2024
…nGH-119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented May 28, 2024

GH-119682 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label May 28, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 28, 2024
…nGH-119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented May 28, 2024

GH-119683 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label May 28, 2024
lysnikolaou added a commit that referenced this pull request May 28, 2024
…19615) (#119682)

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
lysnikolaou added a commit that referenced this pull request May 28, 2024
…19615) (#119683)

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
noahbkim pushed a commit to hudson-trading/cpython that referenced this pull request Jul 11, 2024
…n#119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
…n#119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants