Whitespace terminal rules are always matched first #1828

dsogari · 2025-03-08T11:17:08Z

Langium version: latest
Package name: langium

Steps To Reproduce

Open the playground page.

Copy and paste the following grammar into the Grammar pane:

grammar Test

entry Doc: Node;

Node: name=NAME WS TEXT;

terminal NAME:  /\w+/;
terminal WS:    /\s/; // matches whitespace
terminal TEXT:  /.+/; // matches whitespace

Copy and paste the following content into the Content pane (notice the trailing spaces):
```
abc   
```

Link to code example: playground example

The current behavior

The expected behavior

The content should be parsed and the following syntax tree should appear in the Syntax tree pane:

{
  $type: "Node", 
  name: "abc"
}

Additional notes

I think this behaviour is caused by the following piece of code in token-builder.ts:

const pattern = terminalToken.PATTERN;
if (typeof pattern === 'object' && pattern && 'test' in pattern && isWhitespace(pattern)) {
    tokens.unshift(terminalToken);
} else {
    tokens.push(terminalToken);
}

In other words, whitespace-matching lexer rules are being given priority, thus changing the order specified in the grammar and altering the expected behavior. This is espeically frustrating, for instance, when we need a single whitespace as a delimiter/separator for specific parser rules.

The text was updated successfully, but these errors were encountered:

msujew · 2025-03-08T14:32:16Z

Hey @dsogari,

This is actually working as intended, although we might want to refactor parts of that code anyway soon. Adopters of Langium can always override the DefaultTokenBuilder to prevent the unshift on whitespace tokens.

Note that I'm not entirely sure that even with this adjustment that your language will parse as expected. We use an LL parser that features a separate lexing and parsing phase (see chevrotain). The lexing phase is not context aware, and therefore will likely lead to unintended behavior when it comes to the TEXT terminal. I.e. any text that appears after the WS token in the Node rule that could theoretically match the NAME token will generate a NAME token (and not a TEXT token as expected here). This is just a limitation of how context-free language parsers generally work. What you're building is not a context-free language and therefore not properly representable in Langium.

dsogari · 2025-03-08T20:13:37Z

Oh, I see it now. Thanks for the response!

I guess I was biased from having used Jison in the past, in which we could push and pop lexer states. So in that case the grammar actually ~~has a context~~ is context-aware.

msujew · 2025-03-08T20:28:48Z

I guess I was biased from having used Jison in the past, in which we could push and pop lexer states. So in that case the grammar actually has a context.

This is also possible, but again through the TokenBuilder API. The Langium grammar language isn't designed to expose such fine grained details about the parsing/tokenization.

dsogari · 2025-03-08T21:06:53Z

@msujew I think what I'm looking for is the Lexer Modes feature of Chevrotain. Is there a way to use that within Langium? If so, could you point me to a tutorial or example code that demonstrates it?

dsogari · 2025-03-08T22:43:52Z

Sorry for the noise, but for anyone who might follow this discussion, here are some notes from Chevrotain's tutorial summarizing the issues I've been facing:

The order of Token definitions passed to the Lexer is important. The first PATTERN to match will be chosen not the longest.

The lexer is context unaware, it lexes each token (pattern) individually.

If you need to distinguish between different contexts during the lexing phase, take a look at Lexer Modes.

msujew · 2025-03-08T22:44:19Z

See here (from eclipse-langium/langium-website#132), we simply expose the Chevrotain API directly.

dsogari · 2025-03-08T22:45:58Z

Thanks a lot! I'll take a look at it.

EDIT: that is exactly what I was looking for. :)

dsogari added the bug Something isn't working label Mar 8, 2025

msujew added the as designed The feature in question is working as designed label Mar 8, 2025

spoenemann removed the bug Something isn't working label Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace terminal rules are always matched first #1828

Whitespace terminal rules are always matched first #1828

dsogari commented Mar 8, 2025 •

edited

Loading

msujew commented Mar 8, 2025

dsogari commented Mar 8, 2025 •

edited

Loading

msujew commented Mar 8, 2025

dsogari commented Mar 8, 2025

dsogari commented Mar 8, 2025

msujew commented Mar 8, 2025

dsogari commented Mar 8, 2025 •

edited

Loading

Whitespace terminal rules are always matched first #1828

Whitespace terminal rules are always matched first #1828

Comments

dsogari commented Mar 8, 2025 • edited Loading

Steps To Reproduce

The current behavior

The expected behavior

Additional notes

msujew commented Mar 8, 2025

dsogari commented Mar 8, 2025 • edited Loading

msujew commented Mar 8, 2025

dsogari commented Mar 8, 2025

dsogari commented Mar 8, 2025

msujew commented Mar 8, 2025

dsogari commented Mar 8, 2025 • edited Loading

dsogari commented Mar 8, 2025 •

edited

Loading

dsogari commented Mar 8, 2025 •

edited

Loading

dsogari commented Mar 8, 2025 •

edited

Loading