Skip to content

<regex>: regex_search() sometimes incorrectly matches capturing groups #6118

@muellerj2

Description

@muellerj2

Describe the bug

regex_search() fails to reset the capturing group state correctly between match attempts. Because of this, it might claim that some capturing groups are matched even though they shouldn't be.

Test case

#include <iostream>
#include <regex>
#include <string>

using namespace std;

int main() {
    smatch captures;
    regex re("a|(b)c");
    string input("ba");

    auto result = regex_search(input, captures, re);

    cout << "search succeeded: " << result << '\n';

    if (result) {
        cout << "matched character sequence: " << captures[0].str() << '\n';
        cout << "capturing group 1 matched: " << captures[1].matched << '\n';
        cout << "contents of capturing group 1: " << captures[1].str();
    }

    return 0;
}

Godbolt link: https://godbolt.org/z/Wsb6aEjTf

This program produces the following output:

search succeeded: 1
matched character sequence: a
capturing group 1 matched: 1
contents of capturing group 1: b

Expected behavior

The program should produce the following output:

search succeeded: 1
matched character sequence: a
capturing group 1 matched: 0
contents of capturing group 1: 

STL version

This bug appears to have been introduced in MSVC Build Tools 19.50 and still reproduces on current head.

Additional context

The setup code for the matcher in _Matcher(x)::_Match() has never contained explicit code to reset the capturing group state. But until recently, the matcher tried to reset captures when it encountered an _N_capture node:

STL/stl/inc/regex

Lines 3605 to 3608 in 713dd95

// CodeQL [SM02323] Comparing unchanging unsigned int _Node->_Idx to decreasing size_t _Idx is safe.
for (size_t _Idx = _Tgt_state._Grp_valid.size(); _Node->_Idx < _Idx;) {
_Tgt_state._Grp_valid[--_Idx] = false;
}

But this loop was an inadequate attempt to implement ECMAScript's capturing group reset rule and a major source of bugs in this area. #5456 applied changes to reset capturing groups according to the ECMAScript standard, and one change was to remove this loop. But this had the subtle consequence that capturing groups were no longer reset when the _N_capture node for group 0 was encountered, so the matched capturing groups from a prior failed match attempt could now spill over into the following matches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions