Skip to content
This repository has been archived by the owner on Mar 5, 2022. It is now read-only.

Commit

Permalink
Improve textwrap in presence of zero-width sequences
Browse files Browse the repository at this point in the history
Fixes #287.

Example (with keyword highlighting on, but the highlight effect is
obviously lost in textual form):

Before:

$ googler -n3 --np linux                                                       |
                                                                               |
 1.  Linux.org                                                                 |
     https://www.linux.org/                                                    |
     5 days ago ... Friendly Linux Forum. ... This is a video from my          |
     series of chapters in my book "Essential Linux Command Line"...           |
     Continue… Load more…                                                      |
                                                                               |
 2.  Linux - Wikipedia                                                         |
     https://en.wikipedia.org/wiki/Linux                                       |
     Linux is a family of open source Unix-like operating systems              |
     based on the Linux kernel, an operating system kernel first               |
     released on September 17, 1991, by ...                                    |
                                                                               |
 3.  The Linux Foundation – Supporting Open Source Ecosystems                  |
     https://www.linuxfoundation.org/                                          |
     The Linux Foundation supports the creation of sustainable open            |
     source projects and ecosystems in blockchain, deep learning, networking,  |
     and more.                                                                 |
                                                                               |

After:

$ googler -n3 --np linux                                                       |
                                                                               |
 1.  Linux.org                                                                 |
     https://www.linux.org/                                                    |
     5 days ago ... Friendly Linux Forum. ... This is a video from my series   |
     of chapters in my book "Essential Linux Command Line"... Continue… Load   |
     more…                                                                     |
                                                                               |
 2.  Linux - Wikipedia                                                         |
     https://en.wikipedia.org/wiki/Linux                                       |
     Linux is a family of open source Unix-like operating systems based on the |
     Linux kernel, an operating system kernel first released on September 17,  |
     1991, by ...                                                              |
                                                                               |
 3.  The Linux Foundation – Supporting Open Source Ecosystems                  |
     https://www.linuxfoundation.org/                                          |
     The Linux Foundation supports the creation of sustainable open source     |
     projects and ecosystems in blockchain, deep learning, networking, and     |
     more.                                                                     |
                                                                               |

The idea is to use a text wrapper that keeps track of the position of
each source character, so that zero-width sequences can be inserted at
known offsets afterwards.

So, now we have two hacks on top of PSL textwrap: a CJK monkey patch,
and a position-tracking wrapper. Naturally one would question whether
it's cleaner to just implement a variable-width capable (variable-width
*sequences* capable, not just characters) from scratch. The answer is
no. Just look at the non-variable-width-capable implementation in
PSL[1] and one would conclude that piling on hacks is still cleaner.

[1] https://github.com/python/cpython/blob/3.8/Lib/textwrap.py

Admittedly the TrackedTextwrap implementation is ever so slightly
involved, it would be nice to set up unit tests for it. I actually have
one written but can't really bother to set up the whole unittest
environment for it... So here I include it in the commit message for
prosperity:

    import random
    import re

    import pytest

    @pytest.mark.parametrize("iteration", range(50))
    def test_tracked_textwrap(iteration):
        whitespace = "\t\n\v\f\r "
        s = """This module provides runtime support for type hints as specified by PEP 484, PEP 526, PEP 544,
    PEP 586, PEP 589, and PEP 591. The most fundamental support consists of the types Any, Union, Tuple,
    Callable, TypeVar, and Generic. For full specification please see PEP 484. For a simplified
    introduction to type hints see PEP 483."""
        wrapped = TrackedTextwrap(s, 80)
        lines = wrapped.lines
        # ['This module provides runtime support for type hints as specified by PEP 484, PEP',
        # '526, PEP 544, PEP 586, PEP 589, and PEP 591. The most fundamental support',
        # 'consists of the types Any, Union, Tuple, Callable, TypeVar, and Generic. For',
        # 'full specification please see PEP 484. For a simplified introduction to type',
        # 'hints see PEP 483.']

        # Test all coordinates point to expected characters.
        for offset, ch in enumerate(s):
            row, col = wrapped.get_coordinate(offset)
            assert col <= len(lines[row])
            if col == len(lines[row]):
                # Dropped whitespace
                assert ch in whitespace
            else:
                assert lines[row][col] == ch or (
                    ch in whitespace and lines[row][col] == " "
                )

        # Test insertion.
        # Make the entire paragraph blue.
        insertions = [("\x1b[34m", 0), ("\x1b[0m", len(s))]
        for m in re.finditer(r"PEP\s+\d+", s):
            # Mark all "PEP *" as bold.
            insertions.extend([("\x1b[1m", m.start()), ("\x1b[22m", m.end())])
        # Insert in random order.
        random.shuffle(insertions)
        for seq, offset in insertions:
            wrapped.insert_zero_width_sequence(seq, offset)
        assert wrapped.lines == [
            "\x1b[34mThis module provides runtime support for type hints as specified by \x1b[1mPEP 484\x1b[22m, \x1b[1mPEP",
            "526\x1b[22m, \x1b[1mPEP 544\x1b[22m, \x1b[1mPEP 586\x1b[22m, \x1b[1mPEP 589\x1b[22m, and \x1b[1mPEP 591\x1b[22m. The most fundamental support",
            "consists of the types Any, Union, Tuple, Callable, TypeVar, and Generic. For",
            "full specification please see \x1b[1mPEP 484\x1b[22m. For a simplified introduction to type",
            "hints see \x1b[1mPEP 483\x1b[22m.\x1b[0m",
        ]

Note that I did program very defensively here: the underlying
assumptions about the PSL textwrap algorithm should be sound (I read the
documentaion carefully in full, and grokked the implementation), but I'm
still checking my assumptions and failing noisily in case my assumption
fails.

Final note on minor changes in behavior: LFs in the abstract are no
longer dropped when rendering; they are now handled. I'm honestly don't
even think LFs would survive our parser, where we actively drop them
when constructing the abstract; the `abstract.replace('\n', '')` is
probably an artifact of the past (didn't bother to check). Anyway, now a
remaining LF (if ever) is handled like any other whitespace when passed
through textwrap, which means it's replaced by a space and possibly
dropped.
  • Loading branch information
zmwangx committed Nov 14, 2019
1 parent f258f34 commit 83bf875
Showing 1 changed file with 119 additions and 31 deletions.
150 changes: 119 additions & 31 deletions googler
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,20 @@ try:
except (ImportError, Exception):
pass

from typing import (
Any,
Dict,
Generator,
Iterable,
Iterator,
List,
Match,
Optional,
Tuple,
Union,
cast,
)

# Basic setup

logging.basicConfig(format='[%(levelname)s] %(message)s')
Expand Down Expand Up @@ -152,6 +166,98 @@ def monkeypatch_textwrap_for_cjk():
monkeypatch_textwrap_for_cjk()


CoordinateType = Tuple[int, int]


class TrackedTextwrap:
"""
Implements a text wrapper that tracks the position of each source
character, and can correctly insert zero-width sequences at given
offsets of the source text.
Wrapping result should be the same as that from PSL textwrap.wrap
with default settings except expand_tabs=False.
"""

def __init__(self, text: str, width: int):
self._original = text

# Do the job of replace_whitespace first so that we can easily
# match text to wrapped lines later. Note that this operation
# does not change text length or offsets.
whitespace = "\t\n\v\f\r "
whitespace_trans = str.maketrans(whitespace, " " * len(whitespace))
text = text.translate(whitespace_trans)

self._lines = textwrap.wrap(
text, width, expand_tabs=False, replace_whitespace=False
)

# self._coords track the (row, column) coordinate of each source
# character in the result text. It is indexed by offset in
# source text.
self._coords = [] # type: List[CoordinateType]
offset = 0
try:
if not self._lines:
# Source text only has whitespaces. We add an empty line
# in order to produce meaningful coordinates.
self._lines = [""]
for row, line in enumerate(self._lines):
assert text[offset : offset + len(line)] == line
col = 0
for _ in line:
self._coords.append((row, col))
offset += 1
col += 1
# All subsequent dropped whitespaces map to the last, imaginary column
# (the EOL character if you wish) of the current line.
while offset < len(text) and text[offset] == " ":
self._coords.append((row, col))
offset += 1
# One past the final character (think of it as EOF) should
# be treated as a valid offset.
self._coords.append((row, col))
except AssertionError:
raise RuntimeError(
"TrackedTextwrap: the impossible happened at offset {} of text {!r}".format(
offset, self._original
)
)

# seq should be a zero-width sequence, e.g., an ANSI escape sequence.
# May raise IndexError if offset is out of bounds.
def insert_zero_width_sequence(self, seq: str, offset: int) -> None:
row, col = self._coords[offset]
line = self._lines[row]
self._lines[row] = line[:col] + seq + line[col:]

# Shift coordinates of all characters after the given character
# on the same line.
shift = len(seq)
offset += 1
while offset < len(self._coords) and self._coords[offset][0] == row:
_, col = self._coords[offset]
self._coords[offset] = (row, col + shift)
offset += 1

@property
def original(self) -> str:
return self._original

@property
def lines(self) -> List[str]:
return self._lines

@property
def wrapped(self) -> str:
return "\n".join(self._lines)

# May raise IndexError if offset is out of bounds.
def get_coordinate(self, offset: int) -> CoordinateType:
return self._coords[offset]


### begin dim (DOM implementation with CSS support) ###
### https://github.com/zmwangx/dim/blob/master/dim.py ###

Expand All @@ -162,20 +268,6 @@ from collections import OrderedDict
from enum import Enum
from html.parser import HTMLParser

from typing import (
Any,
Dict,
Generator,
Iterable,
Iterator,
List,
Match,
Optional,
Tuple,
Union,
cast,
)


SelectorGroupLike = Union[str, "SelectorGroup", "Selector"]

Expand Down Expand Up @@ -2284,27 +2376,23 @@ class Result(object):
else:
print(' ' * (indent + 5) + metadata)

fillwidth = (columns - (indent + 6)) if columns > indent + 6 else len(abstract)
wrapped_abstract = TrackedTextwrap(abstract, fillwidth)
if colors and not self.nohl:
# Start from the last match, as inserting the bold characters changes the offsets.
for match in reversed(matches or []):
abstract = (
abstract[: match['offset']]
+ '\033[1m'
+ match['phrase']
+ '\033[0m'
+ abstract[match['offset'] + len(match['phrase']) :]
)
# Highlight matches.
for match in matches or []:
offset = match['offset']
span = len(match['phrase'])
wrapped_abstract.insert_zero_width_sequence('\x1b[1m', offset)
wrapped_abstract.insert_zero_width_sequence('\x1b[0m', offset + span)

if colors:
print(colors.abstract, end='')
if columns > indent + 6:
# Try to fill to columns
fillwidth = columns - (indent + 6)
for line in textwrap.wrap(abstract.replace('\n', ''), width=fillwidth):
print('%s%s' % (' ' * (indent + 5), line))
print('')
else:
print('%s%s\n' % (' ' * (indent + 5), abstract.replace('\n', ' ')))
for line in wrapped_abstract.lines:
print('%s%s' % (' ' * (indent + 5), line))
if colors:
print(colors.reset, end='')
print('')

def print(self):
"""Print the result entry."""
Expand Down

0 comments on commit 83bf875

Please sign in to comment.