Skip to content

Whitespace terminal rules are always matched first #1828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dsogari opened this issue Mar 8, 2025 · 7 comments
Open

Whitespace terminal rules are always matched first #1828

dsogari opened this issue Mar 8, 2025 · 7 comments
Labels
as designed The feature in question is working as designed

Comments

@dsogari
Copy link

dsogari commented Mar 8, 2025

Langium version: latest
Package name: langium

Steps To Reproduce

  1. Open the playground page.

  2. Copy and paste the following grammar into the Grammar pane:

    grammar Test
    
    entry Doc: Node;
    
    Node: name=NAME WS TEXT;
    
    terminal NAME:  /\w+/;
    terminal WS:    /\s/; // matches whitespace
    terminal TEXT:  /.+/; // matches whitespace
    
  3. Copy and paste the following content into the Content pane (notice the trailing spaces):

    abc   
    

Link to code example: playground example

The current behavior

Image

The expected behavior

The content should be parsed and the following syntax tree should appear in the Syntax tree pane:

{
  $type: "Node", 
  name: "abc"
}

Additional notes

I think this behaviour is caused by the following piece of code in token-builder.ts:

const pattern = terminalToken.PATTERN;
if (typeof pattern === 'object' && pattern && 'test' in pattern && isWhitespace(pattern)) {
    tokens.unshift(terminalToken);
} else {
    tokens.push(terminalToken);
}

In other words, whitespace-matching lexer rules are being given priority, thus changing the order specified in the grammar and altering the expected behavior. This is espeically frustrating, for instance, when we need a single whitespace as a delimiter/separator for specific parser rules.

@dsogari dsogari added the bug Something isn't working label Mar 8, 2025
@msujew
Copy link
Member

msujew commented Mar 8, 2025

Hey @dsogari,

This is actually working as intended, although we might want to refactor parts of that code anyway soon. Adopters of Langium can always override the DefaultTokenBuilder to prevent the unshift on whitespace tokens.

Note that I'm not entirely sure that even with this adjustment that your language will parse as expected. We use an LL parser that features a separate lexing and parsing phase (see chevrotain). The lexing phase is not context aware, and therefore will likely lead to unintended behavior when it comes to the TEXT terminal. I.e. any text that appears after the WS token in the Node rule that could theoretically match the NAME token will generate a NAME token (and not a TEXT token as expected here). This is just a limitation of how context-free language parsers generally work. What you're building is not a context-free language and therefore not properly representable in Langium.

@msujew msujew added the as designed The feature in question is working as designed label Mar 8, 2025
@dsogari
Copy link
Author

dsogari commented Mar 8, 2025

Oh, I see it now. Thanks for the response!

I guess I was biased from having used Jison in the past, in which we could push and pop lexer states. So in that case the grammar actually has a context is context-aware.

@msujew
Copy link
Member

msujew commented Mar 8, 2025

I guess I was biased from having used Jison in the past, in which we could push and pop lexer states. So in that case the grammar actually has a context.

This is also possible, but again through the TokenBuilder API. The Langium grammar language isn't designed to expose such fine grained details about the parsing/tokenization.

@dsogari
Copy link
Author

dsogari commented Mar 8, 2025

@msujew I think what I'm looking for is the Lexer Modes feature of Chevrotain. Is there a way to use that within Langium? If so, could you point me to a tutorial or example code that demonstrates it?

@dsogari
Copy link
Author

dsogari commented Mar 8, 2025

Sorry for the noise, but for anyone who might follow this discussion, here are some notes from Chevrotain's tutorial summarizing the issues I've been facing:

  • The order of Token definitions passed to the Lexer is important. The first PATTERN to match will be chosen not the longest.
  • The lexer is context unaware, it lexes each token (pattern) individually.
    • If you need to distinguish between different contexts during the lexing phase, take a look at Lexer Modes.

@msujew
Copy link
Member

msujew commented Mar 8, 2025

See here (from eclipse-langium/langium-website#132), we simply expose the Chevrotain API directly.

@dsogari
Copy link
Author

dsogari commented Mar 8, 2025

Thanks a lot! I'll take a look at it.

EDIT: that is exactly what I was looking for. :)

@spoenemann spoenemann removed the bug Something isn't working label Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
as designed The feature in question is working as designed
Projects
None yet
Development

No branches or pull requests

3 participants