Skip to content

Commit 9eff829

Browse files
committed
[GR-64899] Fix binary Regexp compilation to TRegex
* See #3858 * We need to pass a java.lang.String to TRegex, in this case we can pass it as raw bytes since we also pass the encoding name to TRegex. * Remove the UnsupportedCharsetException catch clause as no Charset should be involved in this conversion since the migration to TruffleString.
1 parent 21550be commit 9eff829

File tree

2 files changed

+7
-8
lines changed

2 files changed

+7
-8
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Compatibility:
2121

2222
Performance:
2323

24+
* Use TRegex for binary Regexps with non-US-ASCII characters in the pattern like `/[\x80-\xff]/n` (#3858, @eregon).
2425

2526
Changes:
2627

src/main/java/org/truffleruby/core/regexp/TRegexCache.java

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@
99
*/
1010
package org.truffleruby.core.regexp;
1111

12-
import java.nio.charset.UnsupportedCharsetException;
13-
1412
import com.oracle.truffle.api.CompilerDirectives;
1513
import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
1614
import com.oracle.truffle.api.interop.InteropLibrary;
@@ -155,12 +153,12 @@ private static Object compileTRegex(RubyContext context, RubyRegexp regexp, bool
155153
var tstring = tstringBuilder.toTString();
156154
try {
157155
processedRegexpSource = TStringUtils.toJavaStringOrThrow(tstring, tstringBuilder.getRubyEncoding());
158-
} catch (CannotConvertBinaryRubyStringToJavaString | UnsupportedCharsetException e) {
159-
// Some strings cannot be converted to Java strings, e.g. strings with the
160-
// BINARY encoding containing characters higher than 127.
161-
// Also, some charsets might not be supported on the JVM and therefore
162-
// a conversion to j.l.String might be impossible.
163-
return null;
156+
} catch (CannotConvertBinaryRubyStringToJavaString e) {
157+
// A BINARY regexp with non-US-ASCII bytes, pass it as "raw bytes" instead.
158+
// TRegex knows how to interpret those bytes correctly as we pass the encoding name as well.
159+
var latin1string = tstring.forceEncodingUncached(Encodings.BINARY.tencoding,
160+
Encodings.ISO_8859_1.tencoding);
161+
processedRegexpSource = TStringUtils.toJavaStringOrThrow(latin1string, Encodings.ISO_8859_1);
164162
}
165163

166164
String flags = optionsToFlags(regexp.options, atStart);

0 commit comments

Comments
 (0)