bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471

serhiy-storchaka · 2017-11-19T23:36:57Z

Also fixed searching patterns that could match an empty string.

This will fix bpo-852532, bpo-1647489, bpo-3262, bpo-25054, and maybe others.

https://bugs.python.org/issue25054

…rns. Fixed searching patterns that could match an empty string.

AraHaan · 2017-11-20T00:17:50Z

Looks like the tests are failing for some reason.

michaelkarlcoleman · 2017-11-20T02:55:53Z

Ha ha ha ha!!! Bless you--you are my hero for today! :-)

vadmium · 2017-12-02T07:30:19Z

Doc/whatsnew/3.7.rst

@@ -768,6 +772,22 @@ Changes in the Python API
  avoid a warning escape them with a backslash.
  (Contributed by Serhiy Storchaka in :issue:`30349`.)

+* The result of splitting a string on a :mod:`regular expression <re>`
+  that could match an empty string like has been changed.  For example


Drop like

vadmium · 2017-12-02T08:12:46Z

Doc/library/re.rst

+      >>> re.split(r'\b', 'Words, words, words.')
+      ['', 'Words', ', ', 'words', ', ', 'words', '.']
+      >>> re.split(r'\W*', 'Words, words, words.')
+      ['', 'W', 'o', 'r', 'd', 's', 'w', 'o', 'r', 'd', 's', 'w', 'o', 'r', 'd', 's', '']


An example with a shorter result may be easier to understand:

>>> re.split(r'\W*', 'Wo, rd.') ['', 'W', 'o', 'r', 'd', '']

I wanted to use an example similar to the above examples (but with * instead of +) for taking the example of the incorrect usage which "worked" in previous versions due to a bug. I'll simplify the example.

vadmium · 2017-12-02T08:39:12Z

Lib/test/test_re.py

+        self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', ''])
+        self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', ''])
+        self.assertEqual(re.sub(r"\b", "-", "a::bc"), '-a-::-bc-')
+        self.assertEqual(re.sub(r"\b|:+", "-", "a::bc"), '-a--bc-')


Perhaps use a callback to verify that the second match is empty, not the third.

I'll use the replacement template that includes the matched string.

vadmium · 2017-12-02T09:25:17Z

Lib/test/test_re.py

+    def test_zerowidth(self):
+        # Issues 852532, 1647489, 3262, 25054.
+        self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', ''])
+        self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', ''])


Perhaps break this down so I can infer what is going on here :)

re.split(r"\b|:", "a:") # How many matches after "a"? re.split(r"\b|:", ":b") # Is there an empty match before "b"? re.split(r":??", ":") # Does it match the colon?

\b matches too much. I'll add separate tests for beginning and ending of words. They are less ambiguous.

But the main purpose of this test is testing that the new behavior differs from the old one. In older Python (2.7 and 3.4) re.split(r"\b|:+", "a::bc") returns ['a:', 'bc'] that doesn't look sane.

vadmium · 2017-12-02T10:54:30Z

Doc/whatsnew/3.7.rst

@@ -364,6 +364,10 @@ The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
 can be set within the scope of a group.
 (Contributed by Serhiy Storchaka in :issue:`31690`.)

+:func:`re.split` now supports splitting on a pattern that matches an empty
+string like ``r'\b'``, ``'^$'`` or ``(?=-)``.


a pattern like . . . that matches an empty string.

vadmium · 2017-12-02T11:24:41Z

Doc/whatsnew/3.7.rst

+  non-empty strings also can be changed.  For example ``r'(?m)^\s*?$'``
+  will match in string ``'a\n\n'`` not only empty strings at positions
+  2 and 3, but also the string ``'\n'`` at positions 2--3.  For matching
+  only blank lines the pattern should be rewritten as ``r'(?m)^[\S\n]*?$'``.


This rewritten expression seems wrong. Won’t the \S (uppercase S) match the non-whitespace a at the start of the string, which is not a blank line? The first expression (which you say is wrong) seemed the most obvious; failing that perhaps r"(?m)^[^\S\n]*$". I.e. complement the character set, and no need for the non-greedy *? repetition.

Good catch! Yes, the negation was missed here. The actual pattern in doctest.py contains it.

There is no difference between greedy and non-greedy repetitions here, but I'll change the pattern to use the greedy repetition as you have suggested.

vadmium · 2017-12-02T13:55:37Z

Doc/whatsnew/3.7.rst

+  The result of repeated searching patterns that could match empty and
+  non-empty strings also can be changed.  For example ``r'(?m)^\s*?$'``
+  will match in string ``'a\n\n'`` not only empty strings at positions
+  2 and 3, but also the string ``'\n'`` at positions 2--3.  For matching


IMO it would be better to clearly document the new behaviour, at least on the module page. But without knowing the exact new behaviour, I suggest to reword this paragraph something like this:

“For patterns that match both empty and non-empty strings, the result of searching for all matches may also be changed in other cases. For example in the string 'a\n\n', the pattern r'(?m)^\s*?$' will not only match empty strings at positions 2 and 3, but also the string '\n' at positions 2–3. To match only blank lines, the pattern should be rewritten as . . .”

Many thanks!

bpo-25054, bpo-1647489: Added support of splitting on zerowidth patte…

09ac71a

…rns. Fixed searching patterns that could match an empty string.

serhiy-storchaka added type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement labels Nov 19, 2017

the-knights-who-say-ni added the CLA signed label Nov 19, 2017

bedevere-bot added the awaiting merge label Nov 19, 2017

Fix doctests.

a445ea3

serhiy-storchaka added 5 commits November 20, 2017 12:31

Document the change in repeated searching of may-be-empty patterns.

33e8373

Use more efficicient pattern in doctest.

f162bb3

Merge branch 'master' into re-search-zerowidth

1ad1e67

Merge branch 'master' into re-search-zerowidth

599d54f

Merge branch 'master' into re-search-zerowidth

37055b2

vadmium reviewed Dec 2, 2017

View reviewed changes

Address review comments.

00cf4b8

serhiy-storchaka mentioned this pull request Dec 2, 2017

bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns (alternate version). #4678

Closed

serhiy-storchaka added 2 commits December 2, 2017 19:54

Update the documentation of findall and finditer.

9c2eb1f

Restore the pre-bpo-732120 wording.

8dda513

serhiy-storchaka merged commit 70d56fb into python:master Dec 4, 2017

bedevere-bot removed the awaiting merge label Dec 4, 2017

serhiy-storchaka deleted the re-search-zerowidth branch December 4, 2017 12:29

back-to mentioned this pull request Jul 26, 2018

URL builder utils streamlink/streamlink#1675

Merged

Uh oh!

bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471

bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471

Uh oh!

Conversation

serhiy-storchaka commented Nov 19, 2017 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AraHaan commented Nov 20, 2017

Uh oh!

michaelkarlcoleman commented Nov 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchaka commented Nov 19, 2017 •

edited by bedevere-bot

Loading