New bug (wrong behavior) in regex (script) properties in Julia 1.6

Hello.

It seems the latest Julia release (nice job, by the way!) introduced a bug for regex properties, especially with scripts…

```julia
text = "aa bb"
text |> collect
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

pattern = r"\p{Ll}+"
eachmatch(pattern, text) |> collect
# Good : 2-element Vector{RegexMatch}: RegexMatch("aa") RegexMatch("bb")

pattern = r"[\p{Ll}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good : 1-element Vector{RegexMatch}: RegexMatch("aa bb")

text = "壹貳 叁"
text |> collect
# '壹': Unicode U+58F9 (category Lo: Letter, other)
# '貳': Unicode U+8CB3 (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# '叁': Unicode U+53C1 (category Lo: Letter, other)

pattern = r"[\p{Han}]+"
eachmatch(pattern, text) |> collect
# Good 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")

text = "カ メ"
text |> collect
# 'カ': Unicode U+30AB (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'メ': Unicode U+30E1 (category Lo: Letter, other)

pattern = r"[\p{L}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana}\s]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")

pattern = r"[\p{Katakana}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")
```
The Letter property works fine, but the script properties (like Han, Katakana, etc.) have a hard time mixed with Spaces property, contrary to the simple space character…

It seems it comes from Julia, not from PCRE2, because using directly [PCRE2](https://www.pcre.org/current/doc/html/pcre2syntax.html) 10.35 (the same version Julia 1.6 seems to use) works fine:

```bash
PCRE2 version 10.35 2020-05-09
  re> "[\p{Ll}\p{Zs}]+"
data> "aa bb"
 0: aa bb

PCRE2 version 10.35 2020-05-09
  re> "(*UTF)[\p{Han}\p{Zs}]+"
data> "壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}
data> "(*UTF)壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}
```

I struggle to understand where this problem comes from (which commit), but I am glad it works well on Julia 1.7 (at least for the moment). I did not find if this problem was found and fixed directly on the Julia repository (or if it was solved indirectly…).

Even if it works on 1.7 (master), it is a real breaking change as it can generate bugs on preexisting codes after the upgrade to 1.6, because this behavior is wrong and unexpected.

Understanding what is/was the problem could allow a more accurate set of tests to prevent a future similar breaking change…

Sincerely.

[Discourse link](https://discourse.julialang.org/t/did-julia-1-6-introduced-a-regression-for-regex-properties/58076).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New bug (wrong behavior) in regex (script) properties in Julia 1.6 #40231

BenjaminGalliot
openedon Mar 27, 2021

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New bug (wrong behavior) in regex (script) properties in Julia 1.6 #40231

Description

BenjaminGalliotopenedon Mar 27, 2021

Metadata