Skip to content

Support the same Regexps to be Regexp.linear_time? as CRuby #3858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eregon opened this issue May 9, 2025 · 8 comments
Open

Support the same Regexps to be Regexp.linear_time? as CRuby #3858

eregon opened this issue May 9, 2025 · 8 comments
Assignees

Comments

@eregon
Copy link
Member

eregon commented May 9, 2025

We noticed in ruby/net-imap#470 that some Regexp are not Regexp.linear_time? on TruffleRuby, but they should be.
See the output in CI of that PR/repo, e.g. https://github.com/ruby/net-imap/actions/runs/14917675208/job/41906831286#step:5:54

One specific I noticed is /[\x80-\xff\r\n]/n, which I have a fix for.
(cc @nevans)

@eregon eregon self-assigned this May 9, 2025
graalvmbot pushed a commit that referenced this issue May 9, 2025
* See #3858
* We need to pass a java.lang.String to TRegex, in this case we can pass it as
  raw bytes since we also pass the encoding name to TRegex.
* Remove the UnsupportedCharsetException catch clause as no Charset should be
  involved in this conversion since the migration to TruffleString.
@eregon
Copy link
Member Author

eregon commented May 12, 2025

Need to look at a new run now that #3859 got merged

@nevans
Copy link

nevans commented May 14, 2025

@eregon Great! That brings it from 28 pending tests (23 Regexp.linear_time?, 5 other) down to 12 pending tests (7 Regexp.linear_time?, 5 other).

@nevans
Copy link

nevans commented May 14, 2025

All seven of the remaining pending Regexp.linear_time? tests are using Unicode properties:

  • /(?-mix:(?-mix:\p{^AGE=3.2})|(?-mix:[\u{06dd 070f 1680 180e 3000 feff e0001}\u{0000}...
    (it's very long... ruby's regexp engine doesn't support the \p{Bidi=*} properties)
  • /\p{^AGE=3.2}/
  • /\p{private use}/
  • /\p{noncharacter code point}/
  • /[\p{in ideographic description characters}&&\p{AGE=3.2}]/
  • /[\p{in specials}&&\p{AGE=3.2}&&\p{^NChar}]/
  • /[\p{in Tags}&&\p{AGE=3.2}]/

@eregon
Copy link
Member Author

eregon commented May 15, 2025

Thank you for checking! (I was planning to check soon but this is very helpful)

The 7 remaining ones are all Unicode properties currently unsupported by TRegex:

$ ruby --log.level=FINE -e '/\p{private use}/ =~ "a"'
...
[regex::TotalCompilationTime] FINE: Total compilation time: 0.233076ms, matcher: bailout, regex: /\\p{private use}/
[regex::BailoutMessages] FINE: TRegex: unsupported Unicode property private use: /\p{private use}/

I'll check with the TRegex folks if it would make sense to implement them in TRegex or not so much.

These Regexps are from:

Net::IMAP::StringPrep::SASLprep::PROHIBITED_STORED
Net::IMAP::StringPrep::Tables::IN_A_1
Net::IMAP::StringPrep::Tables::IN_C_3
Net::IMAP::StringPrep::Tables::IN_C_4
Net::IMAP::StringPrep::Tables::IN_C_7
Net::IMAP::StringPrep::Tables::IN_C_6
Net::IMAP::StringPrep::Tables::IN_C_9

https://github.com/ruby/net-imap/blob/master/lib/net/imap/stringprep/saslprep_tables.rb
https://github.com/ruby/net-imap/blob/master/lib/net/imap/stringprep/tables.rb

@nevans Do you know if these are used by typical net-imap usages, or test-only or something else? Given their location I supposed they are production/non-test code but just wanted to confirm.

@eregon
Copy link
Member Author

eregon commented May 15, 2025

  • (it's very long... ruby's regexp engine doesn't support the \p{Bidi=*} properties)

FWIW, I found there is /\p{Bidi_Control}/ (works on TruffleRuby & CRuby).
There is also /\p{Bidi_Mirrored}/ in TRegex, but Onigmo and JOni preprocessing don't recognize it.
Would \p{Bidi=*} be the union of both maybe?

EDIT: @jirkamarsik said:

As for \p{Bidi=*}, I don't think it's just some alias for a union of Bidi_Control and Bidi_Mirrored. There is a non-binary (not boolean-valued) Unicode property name Bidi Class with values like L, R, BN, ON...
https://www.unicode.org/Public/16.0.0/ucd/extracted/DerivedBidiClass.txt

@eregon
Copy link
Member Author

eregon commented May 15, 2025

I'll check with the TRegex folks if it would make sense to implement them in TRegex or not so much.

There are two cases:

For /\p{private use}/ and /\p{noncharacter code point}/ variants with _ instead of spaces work, but we need to check they are fully equivalent/correct.
EDIT: yes, CRuby removes _- : https://github.com/ruby/ruby/blob/186e60cb68fe8e49455cffc1955637cabd270c81/enc/unicode.c#L235 and https://github.com/ruby/ruby/blob/186e60cb68fe8e49455cffc1955637cabd270c81/tool/enc-unicode.rb#L304

AGE like /\p{^AGE=3.2}/ is not supported by TRegex currently, more info here: https://github.com/oracle/graal/blob/35647e1d4a8db30feb6392633498553c0b9514b0/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/tregex/parser/flavors/RubyFlavor.java#L74-L82

@nevans
Copy link

nevans commented May 19, 2025

@nevans Do you know if these are used by typical net-imap usages, or test-only or something else? Given their location I supposed they are production/non-test code but just wanted to confirm.

@eregon All of the tables and regexps are needed for a complete stringprep (RFC 3454) implementation, but Net::IMAP only needs the SASLprep (RFC 4013) and "trace" profiles (RFC 4505). The "trace" profile is only used by the ANONYMOUS SASL mechanism, and SASLprep is only automatically applied for the SCRAM-* family of SASL mechanisms (SCRAM-SHA-1 and SCRAM-SHA-256).

Although these are SASL mechanisms are not supported by the most popular publicly hosted IMAP servers (e.g: Gmail, Office365, Yahoo), they are supported by open source servers like Dovecot and Cyrus and by many smaller email hosts like Fastmail. For the providers that do support them, SCRAM-SHA-256 is the recommended mechanism for password-based authentication. These mechanisms are also more popular for other (non-IMAP) SASL applications (e.g: postgresql uses SCRAM-SHA-256). At some point, I'd like to extract the Stringprep and SASL implementations from Net::IMAP into their own gems, so it can be used by net-pop, net-smtp, and other gems. It doesn't make sense for each ruby gem that needs SASL or stringprep to roll their own version of them!

For what it's worth: stringprep has been obsoleted by PRECIS (RFC 8264). But in practice, each stringprep application and protocol needs to explicitly release updated specifications stating that stringprep has been replaced by PRECIS. And, as far as I know, none of the updated IMAP/SASL specifications have converted over. Also, I think PRECIS might lean even more heavily on Unicode character properties than stringprep did.

@nevans
Copy link

nevans commented May 19, 2025

Also, yes, @jirkamarsik is correct. I should've written \p{Bidi_Class=*}.

I'm guessing the BIDI Regexps could be much more succinctly compressed as:

BIDI_R_AL     = /[\p{Bidi_Class=R}\p{Bidi_Class=AL}&&\p{AGE=3.2}]/u
BIDI_NOT_R_AL = /[^\p{Bidi_Class=R}\p{Bidi_Class=AL}&&\p{AGE=3.2}]/u
BIDI_L        = /[\p{Bidi_Class=L}&&\p{AGE=3.2}]/u

# If a string contains any RandALCat character, the string MUST NOT
# contain any LCat character.
BIDI_FAILS_REQ2 = Regexp.union(
  /#{BIDI_R_AL}.*?#{BIDI_L}/mu, # RandALCat followed by LCat
  /#{BIDI_L}.*?#{BIDI_R_AL}/mu, # RandALCat preceded by LCat
)

# If a string contains any RandALCat character, a RandALCat
# character MUST be the first character of the string, and a
# RandALCat character MUST be the last character of the string.
BIDI_FAILS_REQ3 = 
  # contains RandALCat:
  Regexp.union(
    /\A#{BIDI_NOT_R_AL}.*?#{BIDI_R_AL}/mu, # but doesn't start with RandALCat
    /#{BIDI_R_AL}.*?#{BIDI_NOT_R_AL}\z/mu, # but doesn't end   with RandALCat
  )

BIDI_FAILS = Regexp.union(BIDI_FAILS_REQ2, BIDI_FAILS_REQ3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants