Skip to content

POSIX submatch semantics is broken in two (rare) cases #2

Open
@neongreen

Description

@neongreen

ChrisKuklewicz/regex-tdfa#12, originally reported by @skvadrik


As I've been working on the implementation of POSIX-compliant capturing groups in RE2C (http://re2c.org), I discovered a couple of bugs in Regex-TDFA. I found them when fuzz-testing my implementation against Regex-TDFA-1.2.2. The algorithm used in RE2C is described in detail in the following paper:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

It is a slightly modified version of Laurikari algorithm. POSIX submatch semantics is due to Kuklewicz: https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475 ,
but I made an attempt on formalizing Kuklewicz algorithm (also described in the paper). The reported bugs are rare (fuzzer found them approximately once in 50000 runs), so they are probably caused by some mis-optimization.

The first bug can be triggered by regular expression (((a*)|b)|b)+ and input string ab: Regex-TDFA returns incorrect submatch result for second capturing group ((a*)|b) (no match instead of b at offset 1). Some alternative regular expressions that cause the same error: (((a*)|b)|b){1,2}, ((b|(a*))|b)+.

$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/  :? for help
Prelude> import Text.Regex.TDFA as T

The error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)*" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(-1,0))]]

The same error:

Prelude T> "ab"  T.=~ "^((b|(a*))|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

Again, the same error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b){1,2}" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

The second bug can be triggered by regular expression ((a?)(())*|a)+ and input string aa. Incorrect result is for second group (a?) (no match instead of a at offset 1), third group (()) and fourth group () (no match instead of empty match at offset 2). Alternative variant that also fails: ((a?()?)|a)+.

The error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a)+" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0)),(4,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "aa"  T.=~ "^((a?)(())*|a)*" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a){1,2}" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?()?)|a)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions