-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Various updates to the Regex HOWTO #107825
base: main
Are you sure you want to change the base?
Conversation
…ing. Remove use of undefined jargon 'cooked'.
Doc/howto/regex.rst
Outdated
To specify them in the pattern, you can write them as an embedded | ||
modifier at the start of the pattern that uses the short one-letter | ||
form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth to mentioned "modifier spans" like (?i:...)
. They are more powerful than global flags and modifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.
Doc/howto/regex.rst
Outdated
|
||
For example, the following RE detects doubled words in a string. :: | ||
|
||
>>> p = re.compile(r'\b(\w+)\s+\1\b') | ||
>>> p = re.compile(r'\b(\w+)\b\s+\1\b') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second \b
was removed intentionally. It is not needed here.
It is worth also to use possessive qualifiers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's fine to keep the second \b
, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)
(Also, what's a possessive qualifier?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly this example, but see the conversation in #21420 about redundant \b
.
This example was fixed in #4443. It was incorrect without \b
at the end, but \b
between \w
and \s
is redundant by definition.
Sorry, not "possessive qualifier" but "possessive quantifier" (although in some documents they are named "qualifiers"). A possessive quantifier is a quantifier without backtracking. It is written by adding +
to the quantifier (as non-greed quantifiers are written by adding ?
). For example, when try to match the pattern with greedy quantifiers \b(\w+)\s+\1\b
in "then the", a dumb backtracking engine will try to match "then then", fail, backtrack and try to match consequentially "the ", "th ", "t " until it give up. But with possessive quantifier \b(\w++)\s++\1\b
it will not backtrack and fail quicker. It is a new feature in Python 3.11. Even if it is supported in most modern RE engines, it is relatively little known, because it was not initially supported in old RE engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've removed the second \b and edited the text below a bit.
It would be nice to add more about possessive qualifiers and atomic grouping. Modifier spans are also underrated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Andrew! Here are some small suggestions. I recommend merging this rather than sitting on it for much longer. If there are improvements you're still planning to make but don't feel you have time for right now, feel free to open another PR. I promise to review and merge quickly -- this looks like almost everything is uncontroversial.
Doc/howto/regex.rst
Outdated
To specify them in the pattern, you can write them as an embedded | ||
modifier at the start of the pattern that uses the short one-letter | ||
form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.
Doc/howto/regex.rst
Outdated
|
||
For example, the following RE detects doubled words in a string. :: | ||
|
||
>>> p = re.compile(r'\b(\w+)\s+\1\b') | ||
>>> p = re.compile(r'\b(\w+)\b\s+\1\b') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's fine to keep the second \b
, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)
(Also, what's a possessive qualifier?)
Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
…o update-regex-howto
OK, I've applied a bunch of suggested revisions, and also adds comments listing future topics such as the possessive quantifiers and spanning modifiers. Let's work on those in future PRs, since this one has already taken long enough! 🕙 |
As people sent me comments over the years, I've been collecting user feedback on the Regex HOWTO. This PR will contain the resulting set of changes. It is currently still work-in-progress; I have a lengthy list of changes that I'm making.
I'll try very hard to keep each commit completely and logically separated, so you may want to proofread commit-by-commit. Feel free to cherry-pick particular commits into main if you like while other commits get worked on; I can rebase or merge and try to keep things coherent.
📚 Documentation preview 📚: https://cpython-previews--107825.org.readthedocs.build/