Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-31672: string: Use re.A | re.I flag for identifier pattern #3872

Merged
merged 7 commits into from
Oct 13, 2017

Conversation

methane
Copy link
Member

@methane methane commented Oct 3, 2017

As documented, identifier should be ASCII.
Since we forgot re.A flag, it matched to some non ASCII characters.

For backward compatibility, we need to remove re.A flag after
pattern is compiled.

https://bugs.python.org/issue31672

@methane methane changed the title bpo-31672: strings.Template should use re.A flag bpo-31672: string: Use re.A | re.I flag for identifier pattern Oct 3, 2017
As documented, identifier should be ASCII.
Since we forgot re.A flag, it matched to some non ASCII characters.

For backward compatibility, we need to remove re.A flag after
pattern is compiled.
@methane methane force-pushed the string-template-ascii branch from b056576 to c43ec46 Compare October 3, 2017 16:35
Copy link
Member

@warsaw warsaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little weird, but with comments I think it could be okay. I guess I'm a +0 so I'd like to get some other opinions.

Lib/string.py Outdated
@@ -81,7 +81,7 @@ class Template(metaclass=_TemplateMetaclass):
delimiter = '$'
idpattern = r'[_a-z][_a-z0-9]*'
braceidpattern = None
flags = _re.IGNORECASE
flags = _re.IGNORECASE | _re.ASCII
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here too, since a person reading the code may not notice the restored flag below.

Lib/string.py Outdated
@@ -157,6 +157,10 @@ def convert(mo):
return self.pattern.sub(convert, self.template)


# We use re.I | re.A when compiling Template.idpattern, but restore old flag
# for backward compatibility.
Template.flags = _re.IGNORECASE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

We use re.I | re.A while compiling Template.idpattern in the metaclass above, but since
flags is part of the public API, we restore its original documented value for backward
compatibility.

?

@warsaw
Copy link
Member

warsaw commented Oct 4, 2017

Do you think it's worth waiting to see how bpo-31690 resolves? I'd definitely prefer the inline 'a' flag approach instead.

@methane
Copy link
Member Author

methane commented Oct 4, 2017

Do you think it's worth waiting to see how bpo-31690 resolves? I'd definitely prefer the inline 'a' flag approach instead.

I agree with you. Pending this pull request until bpo-31690.

Copy link
Member

@warsaw warsaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a NEWS entry instead of a skip-news label, to pass the bevedere test. Other than that, this looks great, thanks!

@methane
Copy link
Member Author

methane commented Oct 12, 2017

@warsaw I added NEWS entry and updated the doc. Would you review it too?

Copy link
Member

@warsaw warsaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! I noticed a few misspellings, and made some documentation change suggestions.

Also, I'm noticing the tests are failing.

``[_a-z][_a-z0-9]*``. If this is given and *braceidpattern* is ``None``
this pattern will also apply to braced placeholders.
``(?-i:[_a-zA-Z][_a-zA-Z0-9]*)``. Since default *flags* is
``re.IGNORECASE``, ``[a-z]``Without local flag ``-i``, is used to avoid to match with non ASCII characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a space before the W, but I also feel like the sentence starting with "Since default..." should just be moved to, and consolidated with, the note:: section that just follows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed removing this sentence after writing note section.

.. note::

Default *flags* is ``re.IGNORECASE``. So the pattern ``[a-z]`` can match
with some non ASCII characters. That's why We use local ``-i`` flag here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-ASCII"

Also s/We/we/

with some non ASCII characters. That's why We use local ``-i`` flag here.

When overrinding this class, please consider overriding *flags* with ``0``
or ``re.IGNORECASE | re.ASCII``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: "When subclassing, please..."

Why? Or in other words, can you add a short explanation of why they should consider this?

@@ -0,0 +1,2 @@
``idpattern`` in ``string.Template`` matched some non ASCII characters. Now
it uses ``-i`` regular expression local flag to avoid non ASCII characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-ASCII" in two places.

@@ -270,6 +270,10 @@ def test_invalid_placeholders(self):
raises(ValueError, s.substitute, dict(who='tim'))
s = Template('$who likes $100')
raises(ValueError, s.substitute, dict(who='tim'))
# Template.idpattern should match to only ASCII characters.
# https://bugs.python.org/issue31672
s = Template("$who likes $ı") # (0x131, DOTLESS I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test also 'İ' (0x130). 'İ'.lower() == 'i'. In older Python versions [a-z] didn't match 'ı', but matched 'İ'.

``None`` this pattern will also apply to braced placeholders.

.. note::

Default *flags* is ``re.IGNORECASE``. So the pattern ``[a-z]`` can match
with some non ASCII characters. That's why We use local ``-i`` flag here.
with some non-ASCII characters. That's why We use local ``-i`` flag here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We -> we

@methane methane merged commit b22273e into python:master Oct 13, 2017
@methane
Copy link
Member Author

methane commented Oct 13, 2017

Should I backport this to 3.6?

It seems minor enough to not backport.
But it seems safe enough to backport too.

@serhiy-storchaka
Copy link
Member

We have spent so much time on this issue only to find 3.6-compatible solution. I think that in 3.7-only solution we could break backward compatibility and remove the re.IGNORECASE flag if not found more compatible solution.

@miss-islington
Copy link
Contributor

Thanks @methane for the PR 🌮🎉.. I'm working now to backport this PR to: 3.6.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry, @methane, I could not cleanly backport this to 3.6 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker b22273ec5d1992b0cbe078b887427ae9977dfb78 3.6

methane added a commit to methane/cpython that referenced this pull request Oct 13, 2017
…dentifiers (pythonGH-3872)

Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters.

Straightforward solution for this is using `IGNORECASE | ASCII` flag.
But users may subclass `Template` and override only `idpattern`. So we want to
avoid changing `Template.flags`.

So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`..
(cherry picked from commit b22273e)
@methane methane deleted the string-template-ascii branch October 13, 2017 07:31
@bedevere-bot
Copy link

GH-3982 is a backport of this pull request to the 3.6 branch.

methane added a commit that referenced this pull request Oct 14, 2017
…iers (GH-3872)

Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters.

Straightforward solution for this is using `IGNORECASE | ASCII` flag.
But users may subclass `Template` and override only `idpattern`. So we want to
avoid changing `Template.flags`.

So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`.
(cherry picked from commit b22273e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants