Description
Hello.
It seems the latest Julia release (nice job, by the way!) introduced a bug for regex properties, especially with scripts…
text = "aa bb"
text |> collect
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
pattern = r"\p{Ll}+"
eachmatch(pattern, text) |> collect
# Good : 2-element Vector{RegexMatch}: RegexMatch("aa") RegexMatch("bb")
pattern = r"[\p{Ll}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good : 1-element Vector{RegexMatch}: RegexMatch("aa bb")
text = "壹貳 叁"
text |> collect
# '壹': Unicode U+58F9 (category Lo: Letter, other)
# '貳': Unicode U+8CB3 (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# '叁': Unicode U+53C1 (category Lo: Letter, other)
pattern = r"[\p{Han}]+"
eachmatch(pattern, text) |> collect
# Good 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")
pattern = r"[\p{Han}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")
pattern = r"[\p{Han} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
text = "カ メ"
text |> collect
# 'カ': Unicode U+30AB (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'メ': Unicode U+30E1 (category Lo: Letter, other)
pattern = r"[\p{L}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
pattern = r"[\p{Katakana} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
pattern = r"[\p{Katakana}\s]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")
pattern = r"[\p{Katakana}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")
The Letter property works fine, but the script properties (like Han, Katakana, etc.) have a hard time mixed with Spaces property, contrary to the simple space character…
It seems it comes from Julia, not from PCRE2, because using directly PCRE2 10.35 (the same version Julia 1.6 seems to use) works fine:
PCRE2 version 10.35 2020-05-09
re> "[\p{Ll}\p{Zs}]+"
data> "aa bb"
0: aa bb
PCRE2 version 10.35 2020-05-09
re> "(*UTF)[\p{Han}\p{Zs}]+"
data> "壹貳 叁"
0: \x{58f9}\x{8cb3} \x{53c1}
data> "(*UTF)壹貳 叁"
0: \x{58f9}\x{8cb3} \x{53c1}
I struggle to understand where this problem comes from (which commit), but I am glad it works well on Julia 1.7 (at least for the moment). I did not find if this problem was found and fixed directly on the Julia repository (or if it was solved indirectly…).
Even if it works on 1.7 (master), it is a real breaking change as it can generate bugs on preexisting codes after the upgrade to 1.6, because this behavior is wrong and unexpected.
Understanding what is/was the problem could allow a more accurate set of tests to prevent a future similar breaking change…
Sincerely.