Skip to content

bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns (alternate version). #4678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
52 changes: 21 additions & 31 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -708,55 +708,45 @@ form.
That way, separator components are always found at the same relative
indices within the result list.

.. note::

:func:`split` doesn't currently split a string on an empty pattern match.
For example::

>>> re.split('x*', 'axbc')
['a', 'bc']
The pattern can match empty strings. ::

Even though ``'x*'`` also matches 0 'x' before 'a', between 'b' and 'c',
and after 'c', currently these matches are ignored. The correct behavior
(i.e. splitting on empty matches too and returning ``['', 'a', 'b', 'c',
'']``) will be implemented in future versions of Python, but since this
is a backward incompatible change, a :exc:`FutureWarning` will be raised
in the meanwhile.

Patterns that can only match empty strings currently never split the
string. Since this doesn't match the expected behavior, a
:exc:`ValueError` will be raised starting from Python 3.5::

>>> re.split("^$", "foo\n\nbar\n", flags=re.M)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
ValueError: split() requires a non-empty pattern match.
>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'(\W*)', '...words...')
['', '...', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '']

.. versionchanged:: 3.1
Added the optional flags argument.

.. versionchanged:: 3.5
Splitting on a pattern that could match an empty string now raises
a warning. Patterns that can only match empty strings are now rejected.
.. versionchanged:: 3.7
Added support of splitting on a pattern that could match an empty string.


.. function:: findall(pattern, string, flags=0)

Return all non-overlapping matches of *pattern* in *string*, as a list of
strings. The *string* is scanned left-to-right, and matches are returned in
the order found. If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern has more than
one group. Empty matches are included in the result unless they touch the
beginning of another match.
one group. Empty matches are included in the result only when not adjacent
to a previous match.

.. versionchanged:: 3.7
Non-empty matches can now start just after a previous empty match. Empty
matches adjacent to a previous match no longer included in the result.


.. function:: finditer(pattern, string, flags=0)

Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
all non-overlapping matches for the RE *pattern* in *string*. The *string*
is scanned left-to-right, and matches are returned in the order found. Empty
matches are included in the result unless they touch the beginning of another
match.
is scanned left-to-right, and matches are returned in the order found.
Empty matches are included in the result only when not adjacent
to a previous match.

.. versionchanged:: 3.7
Non-empty matches can now start just after a previous empty match. Empty
matches adjacent to a previous match no longer included in the result.


.. function:: sub(pattern, repl, string, count=0, flags=0)
Expand Down
23 changes: 23 additions & 0 deletions Doc/whatsnew/3.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,10 @@ The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
can be set within the scope of a group.
(Contributed by Serhiy Storchaka in :issue:`31690`.)

:func:`re.split` now supports splitting on a pattern like ``r'\b'``,
``'^$'`` or ``(?=-)`` that matches an empty string.
(Contributed by Serhiy Storchaka in :issue:`25054`.)

string
------

Expand Down Expand Up @@ -768,6 +772,25 @@ Changes in the Python API
avoid a warning escape them with a backslash.
(Contributed by Serhiy Storchaka in :issue:`30349`.)

* The result of splitting a string on a :mod:`regular expression <re>`
that could match an empty string has been changed. For example
splitting on ``r'\s*'`` will now split not only on whitespaces as it
did previously, but also between any pair of non-whitespace
characters. The previous behavior can be restored by changing the pattern
to ``r'\s+'``. A :exc:`FutureWarning` was emitted for such patterns since
Python 3.5.

For patterns that match both empty and non-empty strings, the result of
searching for all matches may also be changed in other cases. Non-empty
matches can start just after the previous empty match, but empty matches
can not be found just after the end of the previous match. For example
in the string ``'a\n\n'``, the pattern ``r'(?m)^\s*?$'`` will match the
empty string at position 2 and the string ``'\n'`` at positions 2--3
instead of two empty strings at positions 2 and 3. To match only blank
lines, the pattern should be rewritten as ``r'(?m)^[^\S\n]*$'``.

(Contributed by Serhiy Storchaka in :issue:`25054`.)

* :class:`tracemalloc.Traceback` frames are now sorted from oldest to most
recent to be more consistent with :mod:`traceback`.
(Contributed by Jesse Bakker in :issue:`32121`.)
Expand Down
2 changes: 1 addition & 1 deletion Lib/doctest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1611,7 +1611,7 @@ def check_output(self, want, got, optionflags):
'', want)
# If a line in got contains only spaces, then remove the
# spaces.
got = re.sub(r'(?m)^\s*?$', '', got)
got = re.sub(r'(?m)^[^\S\n]+$', '', got)
if got == want:
return True

Expand Down
5 changes: 1 addition & 4 deletions Lib/pprint.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,10 +260,7 @@ def _pprint_str(self, object, stream, indent, allowance, context, level):
chunks.append(rep)
else:
# A list of alternating (non-space, space) strings
parts = re.findall(r'\S*\s*', line)
assert parts
assert not parts[-1]
parts.pop() # drop empty last part
parts = re.findall(r'(?:\S+|\s)\s*', line)
max_width2 = max_width
current = ''
for j, part in enumerate(parts):
Expand Down
45 changes: 31 additions & 14 deletions Lib/test/test_re.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,21 +331,21 @@ def test_re_split(self):
['', 'a', '', '', 'c'])

for sep, expected in [
(':*', ['', 'a', 'b', 'c']),
('(?::*)', ['', 'a', 'b', 'c']),
('(:*)', ['', ':', 'a', ':', 'b', '::', 'c']),
('(:)*', ['', ':', 'a', ':', 'b', ':', 'c']),
(':*', ['', 'a', 'b', 'c', '']),
('(?::*)', ['', 'a', 'b', 'c', '']),
('(:*)', ['', ':', 'a', ':', 'b', '::', 'c', '', '']),
('(:)*', ['', ':', 'a', ':', 'b', ':', 'c', None, '']),
]:
with self.subTest(sep=sep), self.assertWarns(FutureWarning):
with self.subTest(sep=sep):
self.assertTypedEqual(re.split(sep, ':a:b::c'), expected)

for sep, expected in [
('', [':a:b::c']),
(r'\b', [':a:b::c']),
(r'(?=:)', [':a:b::c']),
(r'(?<=:)', [':a:b::c']),
('', ['', ':', 'a', ':', 'b', ':', ':', 'c', '']),
(r'\b', [':', 'a', ':', 'b', '::', 'c', '']),
(r'(?=:)', ['', ':a', ':b', ':', ':c']),
(r'(?<=:)', [':', 'a:', 'b:', ':', 'c']),
]:
with self.subTest(sep=sep), self.assertRaises(ValueError):
with self.subTest(sep=sep):
self.assertTypedEqual(re.split(sep, ':a:b::c'), expected)

def test_qualified_re_split(self):
Expand All @@ -356,9 +356,8 @@ def test_qualified_re_split(self):
['', ':', 'a', ':', 'b::c'])
self.assertEqual(re.split("(:+)", ":a:b::c", maxsplit=2),
['', ':', 'a', ':', 'b::c'])
with self.assertWarns(FutureWarning):
self.assertEqual(re.split("(:*)", ":a:b::c", maxsplit=2),
['', ':', 'a', ':', 'b::c'])
self.assertEqual(re.split("(:*)", ":a:b::c", maxsplit=2),
['', ':', 'a', ':', 'b::c'])

def test_re_findall(self):
self.assertEqual(re.findall(":+", "abc"), [])
Expand Down Expand Up @@ -1333,7 +1332,6 @@ def test_bug_581080(self):
def test_bug_817234(self):
iter = re.finditer(r".*", "asdf")
self.assertEqual(next(iter).span(), (0, 4))
self.assertEqual(next(iter).span(), (4, 4))
self.assertRaises(StopIteration, next, iter)

def test_bug_6561(self):
Expand Down Expand Up @@ -1751,6 +1749,25 @@ def test_match_repr(self):
"span=(3, 5), match='bb'>" %
(type(second).__module__, type(second).__qualname__))

def test_zerowidth(self):
# Issues 852532, 1647489, 3262, 25054.
self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', ''])
self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', ''])
self.assertEqual(re.split(r"(?<!\w)(?=\w)|:+", "a::bc"), ['', 'a', 'bc'])
self.assertEqual(re.split(r"(?<=\w)(?!\w)|:+", "a::bc"), ['a', '', 'bc', ''])

self.assertEqual(re.sub(r"\b", "-", "a::bc"), '-a-::-bc-')
self.assertEqual(re.sub(r"\b|:+", "-", "a::bc"), '-a--bc-')
self.assertEqual(re.sub(r"(\b|:+)", r"[\1]", "a::bc"), '[]a[][::]bc[]')

self.assertEqual(re.findall(r"\b|:+", "a::bc"), ['', '', '::', ''])
self.assertEqual(re.findall(r"\b|\w+", "a::bc"),
['', 'a', '', 'bc'])

self.assertEqual([m.span() for m in re.finditer(r"\b|:+", "a::bc")],
[(0, 0), (1, 1), (1, 3), (5, 5)])
self.assertEqual([m.span() for m in re.finditer(r"\b|\w+", "a::bc")],
[(0, 0), (0, 1), (3, 3), (3, 5)])

def test_bug_2537(self):
# issue 2537: empty submatches
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added support of splitting on a pattern that could match an empty string.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Fixed searching regular expression patterns that could match an empty
string. Non-empty string can now be correctly found after matching an empty
string.
77 changes: 22 additions & 55 deletions Modules/_sre.c
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,8 @@ state_init(SRE_STATE* state, PatternObject* pattern, PyObject* string,

state->isbytes = isbytes;
state->charsize = charsize;
state->match_all = 0;
state->must_advance = 0;

state->beginning = ptr;

Expand Down Expand Up @@ -559,14 +561,14 @@ pattern_dealloc(PatternObject* self)
}

LOCAL(Py_ssize_t)
sre_match(SRE_STATE* state, SRE_CODE* pattern, int match_all)
sre_match(SRE_STATE* state, SRE_CODE* pattern)
{
if (state->charsize == 1)
return sre_ucs1_match(state, pattern, match_all);
return sre_ucs1_match(state, pattern, 1);
if (state->charsize == 2)
return sre_ucs2_match(state, pattern, match_all);
return sre_ucs2_match(state, pattern, 1);
assert(state->charsize == 4);
return sre_ucs4_match(state, pattern, match_all);
return sre_ucs4_match(state, pattern, 1);
}

LOCAL(Py_ssize_t)
Expand Down Expand Up @@ -606,7 +608,7 @@ _sre_SRE_Pattern_match_impl(PatternObject *self, PyObject *string,

TRACE(("|%p|%p|MATCH\n", PatternObject_GetCode(self), state.ptr));

status = sre_match(&state, PatternObject_GetCode(self), 0);
status = sre_match(&state, PatternObject_GetCode(self));

TRACE(("|%p|%p|END\n", PatternObject_GetCode(self), state.ptr));
if (PyErr_Occurred()) {
Expand Down Expand Up @@ -645,7 +647,8 @@ _sre_SRE_Pattern_fullmatch_impl(PatternObject *self, PyObject *string,

TRACE(("|%p|%p|FULLMATCH\n", PatternObject_GetCode(self), state.ptr));

status = sre_match(&state, PatternObject_GetCode(self), 1);
state.match_all = 1;
status = sre_match(&state, PatternObject_GetCode(self));

TRACE(("|%p|%p|END\n", PatternObject_GetCode(self), state.ptr));
if (PyErr_Occurred()) {
Expand Down Expand Up @@ -808,11 +811,8 @@ _sre_SRE_Pattern_findall_impl(PatternObject *self, PyObject *string,
if (status < 0)
goto error;

if (state.ptr == state.start)
state.start = (void*) ((char*) state.ptr + state.charsize);
else
state.start = state.ptr;

state.must_advance = 1;
state.start = state.ptr;
}

state_fini(&state);
Expand Down Expand Up @@ -901,17 +901,6 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
void* last;

assert(self->codesize != 0);
if (self->code[0] != SRE_OP_INFO || self->code[3] == 0) {
if (self->code[0] == SRE_OP_INFO && self->code[4] == 0) {
PyErr_SetString(PyExc_ValueError,
"split() requires a non-empty pattern match.");
return NULL;
}
if (PyErr_WarnEx(PyExc_FutureWarning,
"split() requires a non-empty pattern match.",
1) < 0)
return NULL;
}

if (!state_init(&state, self, string, 0, PY_SSIZE_T_MAX))
return NULL;
Expand Down Expand Up @@ -942,14 +931,6 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
goto error;
}

if (state.start == state.ptr) {
if (last == state.end || state.ptr == state.end)
break;
/* skip one character */
state.start = (void*) ((char*) state.ptr + state.charsize);
continue;
}

/* get segment before this match */
item = getslice(state.isbytes, state.beginning,
string, STATE_OFFSET(&state, last),
Expand All @@ -974,7 +955,7 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
}

n = n + 1;

state.must_advance = 1;
last = state.start = state.ptr;

}
Expand Down Expand Up @@ -1101,9 +1082,7 @@ pattern_subx(PatternObject* self, PyObject* ptemplate, PyObject* string,
if (status < 0)
goto error;

} else if (i == b && i == e && n > 0)
/* ignore empty match on latest position */
goto next;
}

if (filter_is_callable) {
/* pass match object through filter */
Expand All @@ -1130,16 +1109,8 @@ pattern_subx(PatternObject* self, PyObject* ptemplate, PyObject* string,

i = e;
n = n + 1;

next:
/* move on */
if (state.ptr == state.end)
break;
if (state.ptr == state.start)
state.start = (void*) ((char*) state.ptr + state.charsize);
else
state.start = state.ptr;

state.must_advance = 1;
state.start = state.ptr;
}

/* get segment following last match */
Expand Down Expand Up @@ -2450,7 +2421,7 @@ _sre_SRE_Scanner_match_impl(ScannerObject *self)

state->ptr = state->start;

status = sre_match(state, PatternObject_GetCode(self->pattern), 0);
status = sre_match(state, PatternObject_GetCode(self->pattern));
if (PyErr_Occurred())
return NULL;

Expand All @@ -2459,12 +2430,10 @@ _sre_SRE_Scanner_match_impl(ScannerObject *self)

if (status == 0)
state->start = NULL;
else if (state->ptr != state->start)
else {
state->must_advance = 1;
state->start = state->ptr;
else if (state->ptr != state->end)
state->start = (void*) ((char*) state->ptr + state->charsize);
else
state->start = NULL;
}

return match;
}
Expand Down Expand Up @@ -2499,12 +2468,10 @@ _sre_SRE_Scanner_search_impl(ScannerObject *self)

if (status == 0)
state->start = NULL;
else if (state->ptr != state->start)
else {
state->must_advance = 1;
state->start = state->ptr;
else if (state->ptr != state->end)
state->start = (void*) ((char*) state->ptr + state->charsize);
else
state->start = NULL;
}

return match;
}
Expand Down
Loading