-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471
bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471
Conversation
…rns. Fixed searching patterns that could match an empty string.
Looks like the tests are failing for some reason. |
Ha ha ha ha!!! Bless you--you are my hero for today! :-) |
Doc/whatsnew/3.7.rst
Outdated
@@ -768,6 +772,22 @@ Changes in the Python API | |||
avoid a warning escape them with a backslash. | |||
(Contributed by Serhiy Storchaka in :issue:`30349`.) | |||
|
|||
* The result of splitting a string on a :mod:`regular expression <re>` | |||
that could match an empty string like has been changed. For example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Doc/library/re.rst
Outdated
>>> re.split(r'\b', 'Words, words, words.') | ||
['', 'Words', ', ', 'words', ', ', 'words', '.'] | ||
>>> re.split(r'\W*', 'Words, words, words.') | ||
['', 'W', 'o', 'r', 'd', 's', 'w', 'o', 'r', 'd', 's', 'w', 'o', 'r', 'd', 's', ''] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example with a shorter result may be easier to understand:
>>> re.split(r'\W*', 'Wo, rd.')
['', 'W', 'o', 'r', 'd', '']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use an example similar to the above examples (but with *
instead of +
) for taking the example of the incorrect usage which "worked" in previous versions due to a bug. I'll simplify the example.
Lib/test/test_re.py
Outdated
self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', '']) | ||
self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', '']) | ||
self.assertEqual(re.sub(r"\b", "-", "a::bc"), '-a-::-bc-') | ||
self.assertEqual(re.sub(r"\b|:+", "-", "a::bc"), '-a--bc-') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use a callback to verify that the second match is empty, not the third.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use the replacement template that includes the matched string.
def test_zerowidth(self): | ||
# Issues 852532, 1647489, 3262, 25054. | ||
self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', '']) | ||
self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', '']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps break this down so I can infer what is going on here :)
re.split(r"\b|:", "a:") # How many matches after "a"?
re.split(r"\b|:", ":b") # Is there an empty match before "b"?
re.split(r":??", ":") # Does it match the colon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\b
matches too much. I'll add separate tests for beginning and ending of words. They are less ambiguous.
But the main purpose of this test is testing that the new behavior differs from the old one. In older Python (2.7 and 3.4) re.split(r"\b|:+", "a::bc")
returns ['a:', 'bc']
that doesn't look sane.
Doc/whatsnew/3.7.rst
Outdated
@@ -364,6 +364,10 @@ The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE` | |||
can be set within the scope of a group. | |||
(Contributed by Serhiy Storchaka in :issue:`31690`.) | |||
|
|||
:func:`re.split` now supports splitting on a pattern that matches an empty | |||
string like ``r'\b'``, ``'^$'`` or ``(?=-)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a pattern like . . . that matches an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Doc/whatsnew/3.7.rst
Outdated
non-empty strings also can be changed. For example ``r'(?m)^\s*?$'`` | ||
will match in string ``'a\n\n'`` not only empty strings at positions | ||
2 and 3, but also the string ``'\n'`` at positions 2--3. For matching | ||
only blank lines the pattern should be rewritten as ``r'(?m)^[\S\n]*?$'``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rewritten expression seems wrong. Won’t the \S (uppercase S) match the non-whitespace a at the start of the string, which is not a blank line? The first expression (which you say is wrong) seemed the most obvious; failing that perhaps r"(?m)^[^\S\n]*$". I.e. complement the character set, and no need for the non-greedy *? repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Yes, the negation was missed here. The actual pattern in doctest.py
contains it.
There is no difference between greedy and non-greedy repetitions here, but I'll change the pattern to use the greedy repetition as you have suggested.
Doc/whatsnew/3.7.rst
Outdated
The result of repeated searching patterns that could match empty and | ||
non-empty strings also can be changed. For example ``r'(?m)^\s*?$'`` | ||
will match in string ``'a\n\n'`` not only empty strings at positions | ||
2 and 3, but also the string ``'\n'`` at positions 2--3. For matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be better to clearly document the new behaviour, at least on the module page. But without knowing the exact new behaviour, I suggest to reword this paragraph something like this:
“For patterns that match both empty and non-empty strings, the result of searching for all matches may also be changed in other cases. For example in the string 'a\n\n', the pattern r'(?m)^\s*?$' will not only match empty strings at positions 2 and 3, but also the string '\n' at positions 2–3. To match only blank lines, the pattern should be rewritten as . . .”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks!
Also fixed searching patterns that could match an empty string.
This will fix bpo-852532, bpo-1647489, bpo-3262, bpo-25054, and maybe others.
https://bugs.python.org/issue25054