-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>
: Implementation divergence for capture group behavior
#5365
Comments
Firefox:
https://262.ecma-international.org/#sec-runtime-semantics-repeatmatcher-abstract-operation 22.2.2.3.1 RepeatMatcher While continuation passing style is rather hard to read, it's clear that there's no loop around matching the contents (steps 8, 9c and 10) that doesn't also include clearing the matches (step 4). This is also explicitly called out further down:
libc++ is right, libstdc++ is wrong (and ms-stl is so wrong I don't even need to check the spec). I agree that libstdc++/boost is the more intuitive behavior, but the spec is clear. |
I think this is a difference between ECMAScript and POSIX regexes:
|
The weird matching result produced by MSVC STL's regex is due to this loop: Lines 3626 to 3628 in f2a2933
When matching a new capture group starts, this loop unmatches all capture groups with a greater index. In the test case, the capture group "(c)" has a greater index than capture group "(b)", so when "b" is matched in input "acbd", the prior matching of "c" gets unmatched. I think the assumption underlying the loop is the following: In a repetition containing several capture groups, there is one capture group that surrounds all the others. If so, this loop will reset all capture groups in the repetition when matching the outermost capture group starts. However, this ignores that there are non-capturing groups as well. These can result in repetitions that do not have a single outermost capture group, but several outermost capture groups (as in the test case). The result is that matching another repetition might unmatch none, some or all of these capture groups depending on the input. It's probably sufficient to remove (or disable) this loop to get POSIX semantics, but I'm not sure yet what the appropriate changes are for ECMAScript semantics. Perhaps this loop needs to go for ECMAScript semantics as well. |
VS 2022 17.14 Preview 2 prints:
microsoft/STL
main
prints the exact same thing (as of f2a2933 with @muellerj2's amazing #5218 merged), so we haven't regressed or improved. #5218 did fix several other long-standing bugs in our internal database, so I was surprised to see that this one remained.And we have implementation divergence! See: https://godbolt.org/z/cjz8PWaf7
libstdc++ 14.2 and Boost 1.87.0 agree, differing only in their Library output:
But libc++ 20.1 says:
Originally reported as VSO-110491 / AB#110491 (in 2014 or earlier via the now-defunct Microsoft Connect). The original user expected libstdc++/Boost's behavior.
The text was updated successfully, but these errors were encountered: