Skip to content

gh-113304: Add pos/endpos parameters to re module functions #113306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

adamsilkey
Copy link

@adamsilkey adamsilkey commented Dec 20, 2023

Summary

This commit adds the pos and endpos parameters to the following
top-level re module functions:

  • re.match()
  • re.fullmatch()
  • re.search()
  • re.findall()
  • re.finditer()

Prior to this commit, the pos and endpos parameters were only
available to users by first compiling a pattern using re.compile.
Adding these optional arguments standardizes the behavior between
the two and prevents users from being forced to compile if they wish
to use the pos/endpos arguments.

Additionally, this commit:

  • Adds tests to cover keyword argument combinations
  • Updates re.rst documentation as follows:
    • Adds String Indexing Arguments section covering pos/endpos
    • Adds a Special Characters section to improve discoverability
    • Update docs to reflect actual Pattern method signatures

Rationale

There are a number of methods in the Python Regex Pattern class
that support optional string indexing parameters (pos/endpos):

  • Pattern.match(string[, pos[, endpos]])
  • Pattern.fullmatch(string[, pos[, endpos]])
  • Pattern.search(string[, pos[, endpos]])
  • Pattern.findall(string[, pos[, endpos]])
  • Pattern.finditer(string[, pos[, endpos]])

Additionally, Python provides access to these Pattern methods as
top-level convenience functions in the module itself:

  • re.match()
  • re.fullmatch()
  • re.search()
  • re.findall()
  • re.finditer()

However, these top-level convenience functions do not support the
optional arguments. If anyone wants to utilize the optional parameters,
they must first compile a pattern with re.compile() and then call
the method with the optional arguments.

But all the top-level convenience functions do is compile the pattern,
and then execute the pattern, as seen here:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

Looking at the underlying C Code for these methods, the method defines
pos and endpos as 0 and PY_SSIZE_T_MAX respectively. It only
changes the values if the arg parser detects the presence of either
pos or endpos.

Here is an example from the match function, indentation adjusted
for readability:

static PyObject *
_sre_SRE_Pattern_match(PatternObject *self, PyTypeObject *cls, PyObject
*const *args, Py_ssize_t nargs, PyObject *kwnames)
{
    (...)
    Py_ssize_t pos = 0;
    Py_ssize_t endpos = PY_SSIZE_T_MAX;
    (...)
    pos = ival;
    (...)
    endpos = ival;
    (...)
    return_value = _sre_SRE_Pattern_match_impl(self, cls, string, pos, endpos);
}

This commit adds pos=0 and endpos=sys.maxsize to match the
internal behavior of the underlying C code.

Additional Documentation Updates

Add Special Characters section to re docs

Add special characters section to docs to enable finding via the
table of contents and make discoverability easier.

Update Pattern method signatures to reflect actual behavior

The current docs for Pattern.search/match/matchall/finditer/findall
imply that pos/endpos are positional arguments only. But, in
fact, they support keyword assignment, as seen:

>>> import re
>>> pattern = re.compile('abc')
>>> pattern.search('012abc678', pos=3)
<re.Match object; span=(3, 6), match='abc'>
>>> pattern.search('012abc678', endpos=6)
<re.Match object; span=(3, 6), match='abc'>
>>> pattern.search('012abc678', pos=3, endpos=6)
<re.Match object; span=(3, 6), match='abc'>

The interactive help also shows this:

>>> help(pattern.search)
Help on built-in function search:

search(string, pos=0, endpos=9223372036854775807) method of re.Pattern instance
    Scan through string looking for a match, and return a corresponding
    match object instance.

    Return None if no position in the string matches.
(END)

📚 Documentation preview 📚: https://cpython-previews--113306.org.readthedocs.build/

The current docs for Pattern.search/match/matchall/finditer/findall
imply that `pos`/`endpos` are positional arguments only.  But, in
fact, they support keyword assignment, as seen:

>>> import re
>>> pattern = re.compile('abc')
>>> pattern.search('012abc678', pos=3)
<re.Match object; span=(3, 6), match='abc'>
>>> pattern.search('012abc678', endpos=6)
<re.Match object; span=(3, 6), match='abc'>
>>> pattern.search('012abc678', pos=3, endpos=6)
<re.Match object; span=(3, 6), match='abc'>

The interactive help also shows this:

>>> help(pattern.search)
Help on built-in function search:

search(string, pos=0, endpos=9223372036854775807) method of re.Pattern
instance
    Scan through string looking for a match, and return a corresponding
    match object instance.

        Return None if no position in the string matches.
        (END)

This commit updates the signatures of the affected methods in the doc
to reflect.
Add special characters section to docs to enable finding via the
table of contents and make discoverability easier.
This commit adds the `pos` and `endpos` parameters to the following
top-level `re` module functions:

- `re.match()`
- `re.fullmatch()`
- `re.search()`
- `re.findall()`
- `re.finditer()`

Prior to this commit, the `pos` and `endpos` parameters were only
available to users by first compiling a pattern using `re.compile`.
Adding these optional arguments standardizes the behavior between
the two and prevents users from being forced to compile if they wish
to use the `pos`/`endpos` arguments.

Rationale:

There are a number of methods in the Python Regex Pattern class
that support optional positional arguments (pos/endpos):

- `Pattern.match(string[, pos[, endpos]])`
- `Pattern.fullmatch(string[, pos[, endpos]])`
- `Pattern.search(string[, pos[, endpos]])`
- `Pattern.findall(string[, pos[, endpos]])`
- `Pattern.finditer(string[, pos[, endpos]])`

Additionally, Python provides access to these pattern methods as
top-level convenience functions in the module itself:

- `re.search()`
- `re.match()`
- `re.fullmatch()`
- `re.findall()`
- `re.finditer()`

However, these top-level convenience functions do not support the
optional arguments. If anyone wants to utilize the optional arguments,
they must first compile a pattern with `re.compile()` and then call
the method with the optional arguments.

But all the top-level convenience functions do is compile the pattern,
and then execute the pattern, as seen in the commit diff.

Looking at the underlying C Code for these methods, the method defines
`pos` and `endpos` as `0` and `PY_SSIZE_T_MAX` respectively. It only
changes the values if the arg parser detects the presence of either
`pos` or `endpos`.

Here is an example from the match function:

```c
static PyObject *
_sre_SRE_Pattern_match(PatternObject *self, PyTypeObject *cls, PyObject
*const *args, Py_ssize_t nargs, PyObject *kwnames)
{
    (...)
    Py_ssize_t pos = 0;
    Py_ssize_t endpos = PY_SSIZE_T_MAX;
    (...)
    pos = ival;
    (...)
    endpos = ival;
    (...)
    return_value = _sre_SRE_Pattern_match_impl(self, cls, string, pos, endpos);
```
- Add new header section describing the string indexing arguments
- Update function signatures to reflect changes
@ghost
Copy link

ghost commented Dec 20, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Dec 20, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant