Skip to content

regexp: confusing behavior on invalid utf-8 sequences #11185

Closed
@dvyukov

Description

@dvyukov

The following program:

package main

import "regexp"

func main() {
    re := regexp.MustCompile(".")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
    re = regexp.MustCompile("..")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
}

prints:

true
true
true
false
false
true

While the following C++ program:

#include <stdio.h>
#include <re2/re2.h>

int main() {
    RE2 re1(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1));
    RE2 re2(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2));
}

prints:

0
1
0
0
1
0

This raises 2 questions:

  1. Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
  2. Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions