Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Implementation divergence for capture group behavior #5365

Open
StephanTLavavej opened this issue Mar 26, 2025 · 3 comments
Open

<regex>: Implementation divergence for capture group behavior #5365

StephanTLavavej opened this issue Mar 26, 2025 · 3 comments
Labels
bug Something isn't working regex meow is a substring of homeowner

Comments

@StephanTLavavej
Copy link
Member

C:\Temp>type meow.cpp
#include <cstddef>
#include <exception>
#include <print>
#include <string>

#ifdef USE_BOOST
#include <boost/regex.hpp>
namespace std_or_boost = boost;
constexpr auto library = "boost";
#else
#include <regex>
namespace std_or_boost = std;
constexpr auto library = "std";
#endif

using std::exception, std::print, std::println, std::size_t, std::string;

int main() {
    try {
        println("Library: {}", library);
        const string s{"acbd"};
        const std_or_boost::regex r{"(?:(a)|(b)|(c)|(d))+"};
        std_or_boost::smatch m;
        const bool result{std_or_boost::regex_search(s, m, r)};
        println("regex_search: {}", result);
        for (size_t i = 0; i < m.size(); ++i) {
            print("m[{}]", i);
            if (m[i].matched) {
                println(R"(.str(): "{}")", m[i].str());
            } else {
                println(".matched: false");
            }
        }
    } catch (const exception& e) {
        println("Exception: {}", e.what());
    }
}

VS 2022 17.14 Preview 2 prints:

C:\Temp>cl /EHsc /nologo /W4 /std:c++latest /MTd /Od meow.cpp && meow
meow.cpp
Library: std
regex_search: true
m[0].str(): "acbd"
m[1].str(): "a"
m[2].str(): "b"
m[3].matched: false
m[4].str(): "d"

microsoft/STL main prints the exact same thing (as of f2a2933 with @muellerj2's amazing #5218 merged), so we haven't regressed or improved. #5218 did fix several other long-standing bugs in our internal database, so I was surprised to see that this one remained.

And we have implementation divergence! See: https://godbolt.org/z/cjz8PWaf7

libstdc++ 14.2 and Boost 1.87.0 agree, differing only in their Library output:

Library: std
regex_search: true
m[0].str(): "acbd"
m[1].str(): "a"
m[2].str(): "b"
m[3].str(): "c"
m[4].str(): "d"
Library: boost
regex_search: true
m[0].str(): "acbd"
m[1].str(): "a"
m[2].str(): "b"
m[3].str(): "c"
m[4].str(): "d"

But libc++ 20.1 says:

Library: std
regex_search: true
m[0].str(): "acbd"
m[1].matched: false
m[2].matched: false
m[3].matched: false
m[4].str(): "d"

Originally reported as VSO-110491 / AB#110491 (in 2014 or earlier via the now-defunct Microsoft Connect). The original user expected libstdc++/Boost's behavior.

@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Mar 26, 2025
@Alcaro
Copy link
Contributor

Alcaro commented Mar 26, 2025

Firefox:

> "abcd".match(/(?:(a)|(b)|(c)|(d))+/)
Array(5) [ "abcd", undefined, undefined, undefined, "d" ]

https://262.ecma-international.org/#sec-runtime-semantics-repeatmatcher-abstract-operation 22.2.2.3.1 RepeatMatcher

While continuation passing style is rather hard to read, it's clear that there's no loop around matching the contents (steps 8, 9c and 10) that doesn't also include clearing the matches (step 4).

This is also explicitly called out further down:

Note 3 | Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.

libc++ is right, libstdc++ is wrong (and ms-stl is so wrong I don't even need to check the spec).

I agree that libstdc++/boost is the more intuitive behavior, but the spec is clear.

@muellerj2
Copy link
Contributor

I think this is a difference between ECMAScript and POSIX regexes:

  • Disregarding the fact that there are no pipes in basic regexes, I think libstdc++ and Boost are correct when following the rules for basic regexes. The standard implies this behavior by the following sentence in Section 9.3.6: "When a subexpression matches more than one string, a backreference expression corresponding to the subexpression shall refer to the last matched string." (The standard doesn't know the concept of a capture group, so the backreference is the next best thing. And other parts of the standard, e.g., the specification of sed, use the content of the backreference where you would normally refer to capture groups.)
  • Strictly speaking, the POSIX standard doesn't define the notion of backreference for extended regexes, because they don't support backreferences. But I think it is implied by the specification of sed (in newer standard versions), which states that the replacement \n shall be the content of the n-th backreference even when matching with extended regexes. So the standard assumes in the specification of sed that the concept of backreference exists for extended regexes, too, and I think the only reasonable inference is that the rules for basic regexes are applied then as well. (And GNU sed agrees that the same rules should be used, e.g., sed - E 's/((a)|(b)|(c)|(d))+/\2\3\4\5/' prints "abcd".)
  • libc++ is correct under ECMAScript rules, as Alcaro pointed out.

@muellerj2
Copy link
Contributor

muellerj2 commented Mar 27, 2025

The weird matching result produced by MSVC STL's regex is due to this loop:

STL/stl/inc/regex

Lines 3626 to 3628 in f2a2933

for (size_t _Idx = _Tgt_state._Grp_valid.size(); _Node->_Idx < _Idx;) {
_Tgt_state._Grp_valid[--_Idx] = false;
}

When matching a new capture group starts, this loop unmatches all capture groups with a greater index. In the test case, the capture group "(c)" has a greater index than capture group "(b)", so when "b" is matched in input "acbd", the prior matching of "c" gets unmatched.

I think the assumption underlying the loop is the following: In a repetition containing several capture groups, there is one capture group that surrounds all the others. If so, this loop will reset all capture groups in the repetition when matching the outermost capture group starts.

However, this ignores that there are non-capturing groups as well. These can result in repetitions that do not have a single outermost capture group, but several outermost capture groups (as in the test case). The result is that matching another repetition might unmatch none, some or all of these capture groups depending on the input.

It's probably sufficient to remove (or disable) this loop to get POSIX semantics, but I'm not sure yet what the appropriate changes are for ECMAScript semantics. Perhaps this loop needs to go for ECMAScript semantics as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex meow is a substring of homeowner
Projects
None yet
Development

No branches or pull requests

3 participants