This repository has been archived by the owner on Mar 5, 2022. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve textwrap in presence of zero-width sequences
Fixes #287. Example (with keyword highlighting on, but the highlight effect is obviously lost in textual form): Before: $ googler -n3 --np linux | | 1. Linux.org | https://www.linux.org/ | 5 days ago ... Friendly Linux Forum. ... This is a video from my | series of chapters in my book "Essential Linux Command Line"... | Continue… Load more… | | 2. Linux - Wikipedia | https://en.wikipedia.org/wiki/Linux | Linux is a family of open source Unix-like operating systems | based on the Linux kernel, an operating system kernel first | released on September 17, 1991, by ... | | 3. The Linux Foundation – Supporting Open Source Ecosystems | https://www.linuxfoundation.org/ | The Linux Foundation supports the creation of sustainable open | source projects and ecosystems in blockchain, deep learning, networking, | and more. | | After: $ googler -n3 --np linux | | 1. Linux.org | https://www.linux.org/ | 5 days ago ... Friendly Linux Forum. ... This is a video from my series | of chapters in my book "Essential Linux Command Line"... Continue… Load | more… | | 2. Linux - Wikipedia | https://en.wikipedia.org/wiki/Linux | Linux is a family of open source Unix-like operating systems based on the | Linux kernel, an operating system kernel first released on September 17, | 1991, by ... | | 3. The Linux Foundation – Supporting Open Source Ecosystems | https://www.linuxfoundation.org/ | The Linux Foundation supports the creation of sustainable open source | projects and ecosystems in blockchain, deep learning, networking, and | more. | | The idea is to use a text wrapper that keeps track of the position of each source character, so that zero-width sequences can be inserted at known offsets afterwards. So, now we have two hacks on top of PSL textwrap: a CJK monkey patch, and a position-tracking wrapper. Naturally one would question whether it's cleaner to just implement a variable-width capable (variable-width *sequences* capable, not just characters) from scratch. The answer is no. Just look at the non-variable-width-capable implementation in PSL[1] and one would conclude that piling on hacks is still cleaner. [1] https://github.com/python/cpython/blob/3.8/Lib/textwrap.py Admittedly the TrackedTextwrap implementation is ever so slightly involved, it would be nice to set up unit tests for it. I actually have one written but can't really bother to set up the whole unittest environment for it... So here I include it in the commit message for prosperity: import random import re import pytest @pytest.mark.parametrize("iteration", range(50)) def test_tracked_textwrap(iteration): whitespace = "\t\n\v\f\r " s = """This module provides runtime support for type hints as specified by PEP 484, PEP 526, PEP 544, PEP 586, PEP 589, and PEP 591. The most fundamental support consists of the types Any, Union, Tuple, Callable, TypeVar, and Generic. For full specification please see PEP 484. For a simplified introduction to type hints see PEP 483.""" wrapped = TrackedTextwrap(s, 80) lines = wrapped.lines # ['This module provides runtime support for type hints as specified by PEP 484, PEP', # '526, PEP 544, PEP 586, PEP 589, and PEP 591. The most fundamental support', # 'consists of the types Any, Union, Tuple, Callable, TypeVar, and Generic. For', # 'full specification please see PEP 484. For a simplified introduction to type', # 'hints see PEP 483.'] # Test all coordinates point to expected characters. for offset, ch in enumerate(s): row, col = wrapped.get_coordinate(offset) assert col <= len(lines[row]) if col == len(lines[row]): # Dropped whitespace assert ch in whitespace else: assert lines[row][col] == ch or ( ch in whitespace and lines[row][col] == " " ) # Test insertion. # Make the entire paragraph blue. insertions = [("\x1b[34m", 0), ("\x1b[0m", len(s))] for m in re.finditer(r"PEP\s+\d+", s): # Mark all "PEP *" as bold. insertions.extend([("\x1b[1m", m.start()), ("\x1b[22m", m.end())]) # Insert in random order. random.shuffle(insertions) for seq, offset in insertions: wrapped.insert_zero_width_sequence(seq, offset) assert wrapped.lines == [ "\x1b[34mThis module provides runtime support for type hints as specified by \x1b[1mPEP 484\x1b[22m, \x1b[1mPEP", "526\x1b[22m, \x1b[1mPEP 544\x1b[22m, \x1b[1mPEP 586\x1b[22m, \x1b[1mPEP 589\x1b[22m, and \x1b[1mPEP 591\x1b[22m. The most fundamental support", "consists of the types Any, Union, Tuple, Callable, TypeVar, and Generic. For", "full specification please see \x1b[1mPEP 484\x1b[22m. For a simplified introduction to type", "hints see \x1b[1mPEP 483\x1b[22m.\x1b[0m", ] Note that I did program very defensively here: the underlying assumptions about the PSL textwrap algorithm should be sound (I read the documentaion carefully in full, and grokked the implementation), but I'm still checking my assumptions and failing noisily in case my assumption fails. Final note on minor changes in behavior: LFs in the abstract are no longer dropped when rendering; they are now handled. I'm honestly don't even think LFs would survive our parser, where we actively drop them when constructing the abstract; the `abstract.replace('\n', '')` is probably an artifact of the past (didn't bother to check). Anyway, now a remaining LF (if ever) is handled like any other whitespace when passed through textwrap, which means it's replaced by a space and possibly dropped.
- Loading branch information