Skip to content

bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. #4471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 4, 2017
Merged
46 changes: 16 additions & 30 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -708,55 +708,41 @@ form.
That way, separator components are always found at the same relative
indices within the result list.

.. note::

:func:`split` doesn't currently split a string on an empty pattern match.
For example::

>>> re.split('x*', 'axbc')
['a', 'bc']
The pattern can match empty strings. ::

Even though ``'x*'`` also matches 0 'x' before 'a', between 'b' and 'c',
and after 'c', currently these matches are ignored. The correct behavior
(i.e. splitting on empty matches too and returning ``['', 'a', 'b', 'c',
'']``) will be implemented in future versions of Python, but since this
is a backward incompatible change, a :exc:`FutureWarning` will be raised
in the meanwhile.

Patterns that can only match empty strings currently never split the
string. Since this doesn't match the expected behavior, a
:exc:`ValueError` will be raised starting from Python 3.5::

>>> re.split("^$", "foo\n\nbar\n", flags=re.M)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
ValueError: split() requires a non-empty pattern match.
>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'(\W*)', '...words...')
['', '...', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '']

.. versionchanged:: 3.1
Added the optional flags argument.

.. versionchanged:: 3.5
Splitting on a pattern that could match an empty string now raises
a warning. Patterns that can only match empty strings are now rejected.
.. versionchanged:: 3.7
Added support of splitting on a pattern that could match an empty string.


.. function:: findall(pattern, string, flags=0)

Return all non-overlapping matches of *pattern* in *string*, as a list of
strings. The *string* is scanned left-to-right, and matches are returned in
the order found. If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern has more than
one group. Empty matches are included in the result unless they touch the
beginning of another match.
one group. Empty matches are included in the result.

.. versionchanged:: 3.7
Non-empty matches can now start just after a previous empty match.


.. function:: finditer(pattern, string, flags=0)

Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
all non-overlapping matches for the RE *pattern* in *string*. The *string*
is scanned left-to-right, and matches are returned in the order found. Empty
matches are included in the result unless they touch the beginning of another
match.
matches are included in the result.

.. versionchanged:: 3.7
Non-empty matches can now start just after a previous empty match.


.. function:: sub(pattern, repl, string, count=0, flags=0)
Expand Down
21 changes: 21 additions & 0 deletions Doc/whatsnew/3.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,10 @@ The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
can be set within the scope of a group.
(Contributed by Serhiy Storchaka in :issue:`31690`.)

:func:`re.split` now supports splitting on a pattern like ``r'\b'``,
``'^$'`` or ``(?=-)`` that matches an empty string.
(Contributed by Serhiy Storchaka in :issue:`25054`.)

string
------

Expand Down Expand Up @@ -768,6 +772,23 @@ Changes in the Python API
avoid a warning escape them with a backslash.
(Contributed by Serhiy Storchaka in :issue:`30349`.)

* The result of splitting a string on a :mod:`regular expression <re>`
that could match an empty string has been changed. For example
splitting on ``r'\s*'`` will now split not only on whitespaces as it
did previously, but also between any pair of non-whitespace
characters. The previous behavior can be restored by changing the pattern
to ``r'\s+'``. A :exc:`FutureWarning` was emitted for such patterns since
Python 3.5.

For patterns that match both empty and non-empty strings, the result of
searching for all matches may also be changed in other cases. For example
in the string ``'a\n\n'``, the pattern ``r'(?m)^\s*?$'`` will not only
match empty strings at positions 2 and 3, but also the string ``'\n'`` at
positions 2--3. To match only blank lines, the pattern should be rewritten
as ``r'(?m)^[^\S\n]*$'``.

(Contributed by Serhiy Storchaka in :issue:`25054`.)

* :class:`tracemalloc.Traceback` frames are now sorted from oldest to most
recent to be more consistent with :mod:`traceback`.
(Contributed by Jesse Bakker in :issue:`32121`.)
Expand Down
2 changes: 1 addition & 1 deletion Lib/doctest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1611,7 +1611,7 @@ def check_output(self, want, got, optionflags):
'', want)
# If a line in got contains only spaces, then remove the
# spaces.
got = re.sub(r'(?m)^\s*?$', '', got)
got = re.sub(r'(?m)^[^\S\n]+$', '', got)
if got == want:
return True

Expand Down
44 changes: 31 additions & 13 deletions Lib/test/test_re.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,21 +331,21 @@ def test_re_split(self):
['', 'a', '', '', 'c'])

for sep, expected in [
(':*', ['', 'a', 'b', 'c']),
('(?::*)', ['', 'a', 'b', 'c']),
('(:*)', ['', ':', 'a', ':', 'b', '::', 'c']),
('(:)*', ['', ':', 'a', ':', 'b', ':', 'c']),
(':*', ['', 'a', 'b', 'c', '']),
('(?::*)', ['', 'a', 'b', 'c', '']),
('(:*)', ['', ':', 'a', ':', 'b', '::', 'c', '', '']),
('(:)*', ['', ':', 'a', ':', 'b', ':', 'c', None, '']),
]:
with self.subTest(sep=sep), self.assertWarns(FutureWarning):
with self.subTest(sep=sep):
self.assertTypedEqual(re.split(sep, ':a:b::c'), expected)

for sep, expected in [
('', [':a:b::c']),
(r'\b', [':a:b::c']),
(r'(?=:)', [':a:b::c']),
(r'(?<=:)', [':a:b::c']),
('', ['', ':', 'a', ':', 'b', ':', ':', 'c', '']),
(r'\b', [':', 'a', ':', 'b', '::', 'c', '']),
(r'(?=:)', ['', ':a', ':b', ':', ':c']),
(r'(?<=:)', [':', 'a:', 'b:', ':', 'c']),
]:
with self.subTest(sep=sep), self.assertRaises(ValueError):
with self.subTest(sep=sep):
self.assertTypedEqual(re.split(sep, ':a:b::c'), expected)

def test_qualified_re_split(self):
Expand All @@ -356,9 +356,8 @@ def test_qualified_re_split(self):
['', ':', 'a', ':', 'b::c'])
self.assertEqual(re.split("(:+)", ":a:b::c", maxsplit=2),
['', ':', 'a', ':', 'b::c'])
with self.assertWarns(FutureWarning):
self.assertEqual(re.split("(:*)", ":a:b::c", maxsplit=2),
['', ':', 'a', ':', 'b::c'])
self.assertEqual(re.split("(:*)", ":a:b::c", maxsplit=2),
['', ':', 'a', ':', 'b::c'])

def test_re_findall(self):
self.assertEqual(re.findall(":+", "abc"), [])
Expand Down Expand Up @@ -1751,6 +1750,25 @@ def test_match_repr(self):
"span=(3, 5), match='bb'>" %
(type(second).__module__, type(second).__qualname__))

def test_zerowidth(self):
# Issues 852532, 1647489, 3262, 25054.
self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', ''])
self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', ''])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps break this down so I can infer what is going on here :)

re.split(r"\b|:", "a:")  # How many matches after "a"?
re.split(r"\b|:", ":b")  # Is there an empty match before "b"?
re.split(r":??", ":")  # Does it match the colon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\b matches too much. I'll add separate tests for beginning and ending of words. They are less ambiguous.

But the main purpose of this test is testing that the new behavior differs from the old one. In older Python (2.7 and 3.4) re.split(r"\b|:+", "a::bc") returns ['a:', 'bc'] that doesn't look sane.

self.assertEqual(re.split(r"(?<!\w)(?=\w)|:+", "a::bc"), ['', 'a', 'bc'])
self.assertEqual(re.split(r"(?<=\w)(?!\w)|:+", "a::bc"), ['a', '', 'bc', ''])

self.assertEqual(re.sub(r"\b", "-", "a::bc"), '-a-::-bc-')
self.assertEqual(re.sub(r"\b|:+", "-", "a::bc"), '-a--bc-')
self.assertEqual(re.sub(r"(\b|:+)", r"[\1]", "a::bc"), '[]a[][::]bc[]')

self.assertEqual(re.findall(r"\b|:+", "a::bc"), ['', '', '::', '', ''])
self.assertEqual(re.findall(r"\b|\w+", "a::bc"),
['', 'a', '', '', 'bc', ''])

self.assertEqual([m.span() for m in re.finditer(r"\b|:+", "a::bc")],
[(0, 0), (1, 1), (1, 3), (3, 3), (5, 5)])
self.assertEqual([m.span() for m in re.finditer(r"\b|\w+", "a::bc")],
[(0, 0), (0, 1), (1, 1), (3, 3), (3, 5), (5, 5)])

def test_bug_2537(self):
# issue 2537: empty submatches
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added support of splitting on a pattern that could match an empty string.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Fixed searching regular expression patterns that could match an empty
string. Non-empty string can now be correctly found after matching an empty
string.
77 changes: 22 additions & 55 deletions Modules/_sre.c
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,8 @@ state_init(SRE_STATE* state, PatternObject* pattern, PyObject* string,

state->isbytes = isbytes;
state->charsize = charsize;
state->match_all = 0;
state->must_advance = 0;

state->beginning = ptr;

Expand Down Expand Up @@ -559,14 +561,14 @@ pattern_dealloc(PatternObject* self)
}

LOCAL(Py_ssize_t)
sre_match(SRE_STATE* state, SRE_CODE* pattern, int match_all)
sre_match(SRE_STATE* state, SRE_CODE* pattern)
{
if (state->charsize == 1)
return sre_ucs1_match(state, pattern, match_all);
return sre_ucs1_match(state, pattern, 1);
if (state->charsize == 2)
return sre_ucs2_match(state, pattern, match_all);
return sre_ucs2_match(state, pattern, 1);
assert(state->charsize == 4);
return sre_ucs4_match(state, pattern, match_all);
return sre_ucs4_match(state, pattern, 1);
}

LOCAL(Py_ssize_t)
Expand Down Expand Up @@ -606,7 +608,7 @@ _sre_SRE_Pattern_match_impl(PatternObject *self, PyObject *string,

TRACE(("|%p|%p|MATCH\n", PatternObject_GetCode(self), state.ptr));

status = sre_match(&state, PatternObject_GetCode(self), 0);
status = sre_match(&state, PatternObject_GetCode(self));

TRACE(("|%p|%p|END\n", PatternObject_GetCode(self), state.ptr));
if (PyErr_Occurred()) {
Expand Down Expand Up @@ -645,7 +647,8 @@ _sre_SRE_Pattern_fullmatch_impl(PatternObject *self, PyObject *string,

TRACE(("|%p|%p|FULLMATCH\n", PatternObject_GetCode(self), state.ptr));

status = sre_match(&state, PatternObject_GetCode(self), 1);
state.match_all = 1;
status = sre_match(&state, PatternObject_GetCode(self));

TRACE(("|%p|%p|END\n", PatternObject_GetCode(self), state.ptr));
if (PyErr_Occurred()) {
Expand Down Expand Up @@ -808,11 +811,8 @@ _sre_SRE_Pattern_findall_impl(PatternObject *self, PyObject *string,
if (status < 0)
goto error;

if (state.ptr == state.start)
state.start = (void*) ((char*) state.ptr + state.charsize);
else
state.start = state.ptr;

state.must_advance = (state.ptr == state.start);
state.start = state.ptr;
}

state_fini(&state);
Expand Down Expand Up @@ -901,17 +901,6 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
void* last;

assert(self->codesize != 0);
if (self->code[0] != SRE_OP_INFO || self->code[3] == 0) {
if (self->code[0] == SRE_OP_INFO && self->code[4] == 0) {
PyErr_SetString(PyExc_ValueError,
"split() requires a non-empty pattern match.");
return NULL;
}
if (PyErr_WarnEx(PyExc_FutureWarning,
"split() requires a non-empty pattern match.",
1) < 0)
return NULL;
}

if (!state_init(&state, self, string, 0, PY_SSIZE_T_MAX))
return NULL;
Expand Down Expand Up @@ -942,14 +931,6 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
goto error;
}

if (state.start == state.ptr) {
if (last == state.end || state.ptr == state.end)
break;
/* skip one character */
state.start = (void*) ((char*) state.ptr + state.charsize);
continue;
}

/* get segment before this match */
item = getslice(state.isbytes, state.beginning,
string, STATE_OFFSET(&state, last),
Expand All @@ -974,7 +955,7 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
}

n = n + 1;

state.must_advance = 1;
last = state.start = state.ptr;

}
Expand Down Expand Up @@ -1101,9 +1082,7 @@ pattern_subx(PatternObject* self, PyObject* ptemplate, PyObject* string,
if (status < 0)
goto error;

} else if (i == b && i == e && n > 0)
/* ignore empty match on latest position */
goto next;
}

if (filter_is_callable) {
/* pass match object through filter */
Expand All @@ -1130,16 +1109,8 @@ pattern_subx(PatternObject* self, PyObject* ptemplate, PyObject* string,

i = e;
n = n + 1;

next:
/* move on */
if (state.ptr == state.end)
break;
if (state.ptr == state.start)
state.start = (void*) ((char*) state.ptr + state.charsize);
else
state.start = state.ptr;

state.must_advance = 1;
state.start = state.ptr;
}

/* get segment following last match */
Expand Down Expand Up @@ -2450,7 +2421,7 @@ _sre_SRE_Scanner_match_impl(ScannerObject *self)

state->ptr = state->start;

status = sre_match(state, PatternObject_GetCode(self->pattern), 0);
status = sre_match(state, PatternObject_GetCode(self->pattern));
if (PyErr_Occurred())
return NULL;

Expand All @@ -2459,12 +2430,10 @@ _sre_SRE_Scanner_match_impl(ScannerObject *self)

if (status == 0)
state->start = NULL;
else if (state->ptr != state->start)
else {
state->must_advance = (state->ptr == state->start);
state->start = state->ptr;
else if (state->ptr != state->end)
state->start = (void*) ((char*) state->ptr + state->charsize);
else
state->start = NULL;
}

return match;
}
Expand Down Expand Up @@ -2499,12 +2468,10 @@ _sre_SRE_Scanner_search_impl(ScannerObject *self)

if (status == 0)
state->start = NULL;
else if (state->ptr != state->start)
else {
state->must_advance = (state->ptr == state->start);
state->start = state->ptr;
else if (state->ptr != state->end)
state->start = (void*) ((char*) state->ptr + state->charsize);
else
state->start = NULL;
}

return match;
}
Expand Down
4 changes: 3 additions & 1 deletion Modules/sre.h
Original file line number Diff line number Diff line change
Expand Up @@ -67,18 +67,20 @@ typedef struct {
void* end; /* end of original string */
/* attributes for the match object */
PyObject* string;
Py_buffer buffer;
Py_ssize_t pos, endpos;
int isbytes;
int charsize; /* character size */
/* registers */
Py_ssize_t lastindex;
Py_ssize_t lastmark;
void** mark;
int match_all;
int must_advance;
/* dynamically allocated stuff */
char* data_stack;
size_t data_stack_size;
size_t data_stack_base;
Py_buffer buffer;
/* current repeat context */
SRE_REPEAT *repeat;
} SRE_STATE;
Expand Down
Loading