Skip to content

untokenize() does not round-trip for code containing line breaks (\ + \n) #125553

Closed
@tomasr8

Description

@tomasr8

Bug report

Bug description:

Code which contains line breaks is not round-trip invariant:

import tokenize, io

source_code = r"""
1 + \
    2
"""

tokens = list(tokenize.generate_tokens(io.StringIO(source_code).readline))
x = tokenize.untokenize(tokens)
print(x)
# 1 +\
#     2

Notice that the space between + and \ is now missing. The current tokenizer code simply inserts a backslash when it encounters two subsequent tokens with a differeing row offset:

cpython/Lib/tokenize.py

Lines 179 to 182 in 9c2bb7d

row_offset = row - self.prev_row
if row_offset:
self.tokens.append("\\\n" * row_offset)
self.prev_col = 0

I think this should be fixed. The docstring of tokenize.untokenize says:

Round-trip invariant for full input:
Untokenized source will match input source exactly

To fix this, it will probably be necessary to inspect the raw line contents and count how much whitespace there is at the end of the line.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtopic-parsertype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions