bpo-31672: string: Use `re.A | re.I` flag for identifier pattern #3872

methane · 2017-10-03T16:27:53Z

As documented, identifier should be ASCII.
Since we forgot re.A flag, it matched to some non ASCII characters.

For backward compatibility, we need to remove re.A flag after
pattern is compiled.

https://bugs.python.org/issue31672

As documented, identifier should be ASCII. Since we forgot re.A flag, it matched to some non ASCII characters. For backward compatibility, we need to remove re.A flag after pattern is compiled.

warsaw

It seems a little weird, but with comments I think it could be okay. I guess I'm a +0 so I'd like to get some other opinions.

warsaw · 2017-10-04T00:16:23Z

Lib/string.py

@@ -81,7 +81,7 @@ class Template(metaclass=_TemplateMetaclass):
    delimiter = '$'
    idpattern = r'[_a-z][_a-z0-9]*'
    braceidpattern = None
-    flags = _re.IGNORECASE
+    flags = _re.IGNORECASE | _re.ASCII


I would add a comment here too, since a person reading the code may not notice the restored flag below.

warsaw · 2017-10-04T00:18:07Z

Lib/string.py

@@ -157,6 +157,10 @@ def convert(mo):
        return self.pattern.sub(convert, self.template)


+# We use re.I | re.A when compiling Template.idpattern, but restore old flag
+# for backward compatibility.
+Template.flags = _re.IGNORECASE


How about:

We use re.I | re.A while compiling Template.idpattern in the metaclass above, but since
flags is part of the public API, we restore its original documented value for backward
compatibility.

?

warsaw · 2017-10-04T14:11:54Z

Do you think it's worth waiting to see how bpo-31690 resolves? I'd definitely prefer the inline 'a' flag approach instead.

methane · 2017-10-04T14:49:59Z

Do you think it's worth waiting to see how bpo-31690 resolves? I'd definitely prefer the inline 'a' flag approach instead.

I agree with you. Pending this pull request until bpo-31690.

warsaw

I think it's worth adding a NEWS entry instead of a skip-news label, to pass the bevedere test. Other than that, this looks great, thanks!

methane · 2017-10-12T00:24:03Z

@warsaw I added NEWS entry and updated the doc. Would you review it too?

warsaw

Looks great, thanks! I noticed a few misspellings, and made some documentation change suggestions.

Also, I'm noticing the tests are failing.

warsaw · 2017-10-12T13:34:03Z

Doc/library/string.rst

-  ``[_a-z][_a-z0-9]*``.  If this is given and *braceidpattern* is ``None``
-  this pattern will also apply to braced placeholders.
+  ``(?-i:[_a-zA-Z][_a-zA-Z0-9]*)``. Since default *flags* is
+  ``re.IGNORECASE``, ``[a-z]``Without local flag ``-i``, is used to avoid to match with non ASCII characters.


There should be a space before the W, but I also feel like the sentence starting with "Since default..." should just be moved to, and consolidated with, the note:: section that just follows.

Oh, I missed removing this sentence after writing note section.

warsaw · 2017-10-12T13:34:27Z

Doc/library/string.rst

+  .. note::
+
+     Default *flags* is ``re.IGNORECASE``.  So the pattern ``[a-z]`` can match
+     with some non ASCII characters.  That's why We use local ``-i`` flag here.


"non-ASCII"

Also s/We/we/

warsaw · 2017-10-12T13:35:41Z

Doc/library/string.rst

+     with some non ASCII characters.  That's why We use local ``-i`` flag here.
+
+     When overrinding this class, please consider overriding *flags* with ``0``
+     or ``re.IGNORECASE | re.ASCII``.


How about: "When subclassing, please..."

Why? Or in other words, can you add a short explanation of why they should consider this?

warsaw · 2017-10-12T13:36:33Z

Misc/NEWS.d/next/Library/2017-10-12-02-47-16.bpo-31672.DaOkVd.rst

@@ -0,0 +1,2 @@
+``idpattern`` in ``string.Template`` matched some non ASCII characters. Now
+it uses ``-i`` regular expression local flag to avoid non ASCII characters.


"non-ASCII" in two places.

serhiy-storchaka · 2017-10-12T13:57:02Z

Lib/test/test_string.py

@@ -270,6 +270,10 @@ def test_invalid_placeholders(self):
        raises(ValueError, s.substitute, dict(who='tim'))
        s = Template('$who likes $100')
        raises(ValueError, s.substitute, dict(who='tim'))
+        # Template.idpattern should match to only ASCII characters.
+        # https://bugs.python.org/issue31672
+        s = Template("$who likes $ı")  # (0x131, DOTLESS I)


Test also 'İ' (0x130). 'İ'.lower() == 'i'. In older Python versions [a-z] didn't match 'ı', but matched 'İ'.

serhiy-storchaka · 2017-10-13T05:23:58Z

Doc/library/string.rst

  ``None`` this pattern will also apply to braced placeholders.

  .. note::

     Default *flags* is ``re.IGNORECASE``.  So the pattern ``[a-z]`` can match
-     with some non ASCII characters.  That's why We use local ``-i`` flag here.
+     with some non-ASCII characters.  That's why We use local ``-i`` flag here.


methane · 2017-10-13T07:03:09Z

Should I backport this to 3.6?

It seems minor enough to not backport.
But it seems safe enough to backport too.

serhiy-storchaka · 2017-10-13T07:17:08Z

We have spent so much time on this issue only to find 3.6-compatible solution. I think that in 3.7-only solution we could break backward compatibility and remove the re.IGNORECASE flag if not found more compatible solution.

miss-islington · 2017-10-13T07:26:50Z

Thanks @methane for the PR 🌮🎉.. I'm working now to backport this PR to: 3.6.
🐍🍒⛏🤖

miss-islington · 2017-10-13T07:26:53Z

Sorry, @methane, I could not cleanly backport this to 3.6 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker b22273ec5d1992b0cbe078b887427ae9977dfb78 3.6

…dentifiers (pythonGH-3872) Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters. Straightforward solution for this is using `IGNORECASE | ASCII` flag. But users may subclass `Template` and override only `idpattern`. So we want to avoid changing `Template.flags`. So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`.. (cherry picked from commit b22273e)

bedevere-bot · 2017-10-13T07:33:12Z

GH-3982 is a backport of this pull request to the 3.6 branch.

…iers (GH-3872) Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters. Straightforward solution for this is using `IGNORECASE | ASCII` flag. But users may subclass `Template` and override only `idpattern`. So we want to avoid changing `Template.flags`. So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`. (cherry picked from commit b22273e)

bedevere-bot added the awaiting merge label Oct 3, 2017

the-knights-who-say-ni added the CLA signed label Oct 3, 2017

methane changed the title ~~bpo-31672: strings.Template should use re.A flag~~ bpo-31672: string: Use re.A | re.I flag for identifier pattern Oct 3, 2017

bpo-31672: strings.Template should use re.A flag

c43ec46

As documented, identifier should be ASCII. Since we forgot re.A flag, it matched to some non ASCII characters. For backward compatibility, we need to remove re.A flag after pattern is compiled.

methane force-pushed the string-template-ascii branch from b056576 to c43ec46 Compare October 3, 2017 16:35

warsaw reviewed Oct 4, 2017

View reviewed changes

Update comment

3862b33

Use -i local flag, as suggested by Serhiy

586ef34

warsaw approved these changes Oct 11, 2017

View reviewed changes

Add NEWS and update document

8b2f429

warsaw reviewed Oct 12, 2017

View reviewed changes

serhiy-storchaka reviewed Oct 12, 2017

View reviewed changes

Update based on review.

32b765b

serhiy-storchaka reviewed Oct 13, 2017

View reviewed changes

methane added 2 commits October 13, 2017 15:37

Update string.rst

1e4bb62

Update string.rst

961a206

serhiy-storchaka approved these changes Oct 13, 2017

View reviewed changes

methane merged commit b22273e into python:master Oct 13, 2017

bedevere-bot removed the awaiting merge label Oct 13, 2017

methane added the needs backport to 3.6 label Oct 13, 2017

methane deleted the string-template-ascii branch October 13, 2017 07:31

bedevere-bot removed the needs backport to 3.6 label Oct 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-31672: string: Use `re.A | re.I` flag for identifier pattern #3872

bpo-31672: string: Use `re.A | re.I` flag for identifier pattern #3872

methane commented Oct 3, 2017 •

edited by bedevere-bot

Loading

warsaw left a comment

warsaw Oct 4, 2017

warsaw Oct 4, 2017

warsaw commented Oct 4, 2017

methane commented Oct 4, 2017

warsaw left a comment

methane commented Oct 12, 2017

warsaw left a comment

warsaw Oct 12, 2017

methane Oct 13, 2017

warsaw Oct 12, 2017

warsaw Oct 12, 2017

warsaw Oct 12, 2017

serhiy-storchaka Oct 12, 2017

serhiy-storchaka Oct 13, 2017

methane commented Oct 13, 2017

serhiy-storchaka commented Oct 13, 2017

miss-islington commented Oct 13, 2017

miss-islington commented Oct 13, 2017

bedevere-bot commented Oct 13, 2017

		@@ -0,0 +1,2 @@
		``idpattern`` in ``string.Template`` matched some non ASCII characters. Now
		it uses ``-i`` regular expression local flag to avoid non ASCII characters.

bpo-31672: string: Use re.A | re.I flag for identifier pattern #3872

bpo-31672: string: Use re.A | re.I flag for identifier pattern #3872

Conversation

methane commented Oct 3, 2017 • edited by bedevere-bot Loading

warsaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warsaw commented Oct 4, 2017

methane commented Oct 4, 2017

warsaw left a comment

Choose a reason for hiding this comment

methane commented Oct 12, 2017

warsaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

methane commented Oct 13, 2017

serhiy-storchaka commented Oct 13, 2017

miss-islington commented Oct 13, 2017

miss-islington commented Oct 13, 2017

bedevere-bot commented Oct 13, 2017

bpo-31672: string: Use `re.A | re.I` flag for identifier pattern #3872

bpo-31672: string: Use `re.A | re.I` flag for identifier pattern #3872

methane commented Oct 3, 2017 •

edited by bedevere-bot

Loading