Skip to content

New bug (wrong behavior) in regex (script) properties in Julia 1.6 #40231

Closed

Description

Hello.

It seems the latest Julia release (nice job, by the way!) introduced a bug for regex properties, especially with scripts…

text = "aa bb"
text |> collect
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

pattern = r"\p{Ll}+"
eachmatch(pattern, text) |> collect
# Good : 2-element Vector{RegexMatch}: RegexMatch("aa") RegexMatch("bb")

pattern = r"[\p{Ll}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good : 1-element Vector{RegexMatch}: RegexMatch("aa bb")

text = "壹貳 叁"
text |> collect
# '壹': Unicode U+58F9 (category Lo: Letter, other)
# '貳': Unicode U+8CB3 (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# '叁': Unicode U+53C1 (category Lo: Letter, other)

pattern = r"[\p{Han}]+"
eachmatch(pattern, text) |> collect
# Good 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")

text = "カ メ"
text |> collect
# 'カ': Unicode U+30AB (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'メ': Unicode U+30E1 (category Lo: Letter, other)

pattern = r"[\p{L}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana}\s]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")

pattern = r"[\p{Katakana}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")

The Letter property works fine, but the script properties (like Han, Katakana, etc.) have a hard time mixed with Spaces property, contrary to the simple space character…

It seems it comes from Julia, not from PCRE2, because using directly PCRE2 10.35 (the same version Julia 1.6 seems to use) works fine:

PCRE2 version 10.35 2020-05-09
  re> "[\p{Ll}\p{Zs}]+"
data> "aa bb"
 0: aa bb

PCRE2 version 10.35 2020-05-09
  re> "(*UTF)[\p{Han}\p{Zs}]+"
data> "壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}
data> "(*UTF)壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}

I struggle to understand where this problem comes from (which commit), but I am glad it works well on Julia 1.7 (at least for the moment). I did not find if this problem was found and fixed directly on the Julia repository (or if it was solved indirectly…).

Even if it works on 1.7 (master), it is a real breaking change as it can generate bugs on preexisting codes after the upgrade to 1.6, because this behavior is wrong and unexpected.

Understanding what is/was the problem could allow a more accurate set of tests to prevent a future similar breaking change…

Sincerely.

Discourse link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    regressionRegression in behavior compared to a previous version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions