during tokenize, use UTF8 encoding on all platforms #510

drammock · 2023-10-09T19:31:03Z

(This is a PR-as-issue, but if I've guessed the wrong solution please feel free to close or suggest a better fix.)

An MNE-Python user who was trying to build our docs on Windows hit this error today:

Traceback (most recent call last):
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\sphinx\events.py", line 97, in emit
    results.append(listener.handler(self.app, *args))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\numpydoc\numpydoc.py", line 214, in mangle_docstrings
    report = validate(doc)
             ^^^^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\numpydoc\validate.py", line 617, in validate
    ignore_validation_comments = extract_ignore_validation_comments(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\numpydoc\validate.py", line 145, in extract_ignore_validation_comments
    for token in tokenize.generate_tokens(file.readline):
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\tokenize.py", line 454, in _tokenize
    line = readline()
           ^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3922: character maps to <undefined>

re-running after export PYTHONUTF8=1 resolved the issue, so I think explicitly invoking utf-8 during read should also prevent the error without requiring any user action.

Technically I think this is a backwards-incompatible change for windows users who had any non-ASCII characters in their source files if those characters are encoded differently in utf-8 than they are in the system's default codepage (which will vary with OS language settings). However, PYTHONUTF8=1 will become the effective default in 2025 with the release of Python 3.15 (so affected users will need to address this change eventually anyway).

drammock · 2023-10-09T21:44:56Z

I've now added a better test that captures the actual case we encountered. ready for review from my end.

numpydoc/tests/test_validate.py

jarrodmillman · 2023-10-09T23:38:15Z

See #505 and #506.

rossbar

Thanks @drammock , the test is very helpful!

drammock added 2 commits October 9, 2023 13:55

during tokenize, use UTF8 encoding on all platforms

3bc6ec8

add test

49bdc96

drammock mentioned this pull request Oct 9, 2023

doc build error on windows mne-tools/mne-python#12093

Closed

jarrodmillman added the type: Bug fix label Oct 9, 2023

better test

8f1dbf5

drammock commented Oct 9, 2023

View reviewed changes

numpydoc/tests/test_validate.py Outdated Show resolved Hide resolved

fix typo in pytest param ID

83929ea

jarrodmillman mentioned this pull request Oct 9, 2023

fix: encoding issue in numpydoc-validation #506

Closed

jarrodmillman approved these changes Oct 9, 2023

View reviewed changes

rossbar approved these changes Oct 10, 2023

View reviewed changes

rossbar merged commit 80a8708 into numpy:main Oct 10, 2023

rossbar mentioned this pull request Oct 10, 2023

Bug located in the numpydoc-validation pre-commit hook #505

Closed

drammock deleted the tokenize-encoding branch December 12, 2023 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

during tokenize, use UTF8 encoding on all platforms #510

during tokenize, use UTF8 encoding on all platforms #510

drammock commented Oct 9, 2023

Uh oh!

drammock commented Oct 9, 2023

Uh oh!

Uh oh!

jarrodmillman commented Oct 9, 2023

Uh oh!

rossbar left a comment

Uh oh!

Uh oh!

Uh oh!

during tokenize, use UTF8 encoding on all platforms #510

during tokenize, use UTF8 encoding on all platforms #510

Conversation

drammock commented Oct 9, 2023

Uh oh!

drammock commented Oct 9, 2023

Uh oh!

Uh oh!

jarrodmillman commented Oct 9, 2023

Uh oh!

rossbar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!