Skip to content

Support infix operator rules #1836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Support infix operator rules #1836

wants to merge 3 commits into from

Conversation

msujew
Copy link
Member

@msujew msujew commented Mar 11, 2025

Closes #1065

Adds support for a special infix operator syntax that aims to improve Langium on multiple fronts:

  1. Improves readability/maintainability of Langium grammars
  2. Simplify specifying precedence for infix binary expressions
  3. Improves performance by parsing the binary expression in a flat list - the parser will automatically transform the resulting object into a tree structure.

Note that the same AST node is constructed as if you were to write the parser rules manually. Also note that this PR assumes that all binary operators are left-associative (i.e. A + B + C is parsed as (A + B) + C). We can revisit this decision in case we find a case for right-associative grammar rules. Note that this only requires changes in the tree restructuring routine.

@msujew msujew added the parser Parser related issue label Mar 11, 2025
@msujew msujew requested a review from spoenemann March 11, 2025 14:41
@aabounegm
Copy link
Member

Amazing effort on a long-requested feature! I just have a question: how would this work if I want to create a different AST node type for each operator (like here)? Would something like this work?

Expr: Equal;

infix Equal on Multiplication: '==';
infix Multiplication on Addition: '*';
infix Addition on Primary: '+';

Primary infers Expr: '(' Expression ')' | NumberLiteral;

@msujew
Copy link
Member Author

msujew commented Mar 11, 2025

@aabounegm Good question: Splitting the infix operator rule into multiple rules would result in decreased performance - we get a boost by reducing the amount of parser rules (and therefore subrule calls) we need to create.

Instead, I would propose something like this perhaps?

Expr: BinaryExpr;

infix BinaryExpr on Primary: {infer Equal} '==' > {infer Multiplication} '*' > {infer Addition} '+';

Primary infers Expr: '(' Expression ')' | NumberLiteral;

I think that should work pretty well. However, I would attempt to implement such a feature in a separate step/PR. This PR is complex enough as it is.

@aabounegm
Copy link
Member

In my grammar (linked above), I already have multiple rules since the language is for educational purposes. Do you mean the performance will be even worse or the same as the way it is currently defined in my grammar?

@msujew
Copy link
Member Author

msujew commented Mar 11, 2025

Do you mean the performance will be even worse or the same as the way it is currently defined in my grammar?

It will be the same (maybe very slightly worse, due to the additional transformation step done to convert the flat list into an expression tree), but you won't gain anything from it. This PR allows to eliminate multiple levels of precedence in order to massively boost performance when parsing expressions.

@ydaveluy
Copy link
Contributor

How does it works when types are not infered (using pre-defined interfaces)?

interface Expression {
}
interface BinaryOperation extends Expression {
	leftOperand: Expression
	rightOperand: Expression
	feature: string
}
interface UnaryOperation extends Expression {
	operand: Expression
	feature: string
}
// how to map leftOperand, rightOperand and feature ?
infix BinaryExpression on UnaryOperation: '%' > '^' > '*' | '/' > '+' | '-';

UnaryOperation returns Expression:
    {UnaryOperation} feature=("!" | "-" | "+" | "~")  operand=Literal
;

@msujew
Copy link
Member Author

msujew commented Mar 11, 2025

@ydaveluy Currently not supported, but I already plan to address this in a separate PR.

@szarnekow
Copy link

in order to massively boost performance when parsing expressions.

can you provide some numbers?

@ydaveluy
Copy link
Contributor

@msujew The infix rule is very concise but contrary to all other rules it is not possible to infer the names of the attributes for the left operand, feature and right operand. If a future PR provides a syntax using pre-defined types it will mean maybe syntax very different.

Maybe a common syntax handling both cases would be more approriate, something like this for inferred types:

BinaryExpression: 
    left=BinaryExpression feature=( '%' > '^' > '*' | '/' > '+' | '-') right=BinaryExpression | PrimaryExpression;

and this for pre-defined types:

interface Expression {
}
interface BinaryExpression extends Expression {
	leftOperand: Expression
	rightOperand: Expression
	feature: string
}
BinaryExpression returns Expression: 
  {BinaryExpression}  left=BinaryExpression feature=( '%' > '^' > '*' | '/' > '+' | '-') right=BinaryExpression | PrimaryExpression;

I don't know if it is technically possible.

@msujew
Copy link
Member Author

msujew commented Mar 12, 2025

can you provide some numbers?

@szarnekow Sure, I just ran the branch through ~4100 lines of the arithmetics example code (parser only). On main, this currently takes 430ms, whereas this branch only requires 260ms. So we're talking about 40% improved performance wrt to expression parsing. I used this code and placed in the packages/arithmetics/test directory:

import { describe, expect, test } from 'vitest';
import { createArithmeticsServices } from '../src/language-server/arithmetics-module.js';
import { EmptyFileSystem } from 'langium';

describe('Test the arithmetics parsing', () => {

    const services = createArithmeticsServices(EmptyFileSystem);
    const parser = services.arithmetics.parser.LangiumParser;

    test('Simple expression', () => {
        let text = '1 + 2 ^ 3 * 4 - 5 / 6;';
        for (let i = 0; i < 12; i++) {
            text = text + text;
        }
        console.time('parse');
        const parseResult = parser.parse('module test ' + text);
        console.timeEnd('parse');
        expect(parseResult.parserErrors).toHaveLength(0);
    });
});

I don't know if it is technically possible.

@ydaveluy I'm not sure either. Your proposal still includes building a tree-like structure for infix operators - however, that's exactly what we want to prevent. Tranforming that into the expected flat structure we need for parsing is maybe possible, although you have so many degrees of freedom that it likely just becomes a bottomless hole. Instead I thought that something like this could work:

interface Expression {
}
interface BinaryExpression extends Expression {
	leftOperand: Expression
	rightOperand: Expression
	feature: string
}
infix BinaryExpression on Primary returns BinaryExpression {
  left: leftOperand,
  right: rightOperand,
  operator: feature
}: '%' > '^' > '*' | '/' > '+' | '-';

Or alternatively for inferred types:

infix BinaryExpression on Primary infers BinaryExpression {
  left: leftOperand,
  right: rightOperand,
  operator: feature
}: '%' > '^' > '*' | '/' > '+' | '-';

@aabounegm
Copy link
Member

@ydaveluy's suggestion had me wondering, isn't the new infix Binary on Primary: '+'; syntax basically syntax sugar for Binary: Primary | Binary '+' Binary; I understand that left-recursion in general is not yet supported, but this infix rule seems to be a specific case of left-recursion. If my understanding is correct, would it also not be technically possible to just detect this specific pattern (Rule: Base | lhs=Rule op='+' rhs=Rule;) and perform the same transformation that this PR does to the infix rule, rather than introduce new syntax?


Modulo infers Expression:
PrimaryExpression ({infer BinaryExpression.left=current} operator='%' right=PrimaryExpression)*;
infix BinaryExpression on PrimaryExpression: '%' > '^' > '*' | '/' > '+' | '-';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally the precedence of > is greater than the one of |. It reads a bit unnatural.
Adding parentheses could be an option to enumerate all alternatives.
Or maybe a equal sign instead of the |.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why not using the comma operator instead of '>' ?

'%' , '^' , '*' | '/' , '+' | '-';

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‚>‘ implies an order between the operators, that makes totally sense to me. Having a comma, you do not know what comes first: low or high precedence. But you can assume some convention, true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can mitigate this by writing each group in its own line:

infix BinaryExpression on PrimaryExpression:
    '%'
    > '^'
    > '*' | '/'
    > '+' | '-';

This is how I'd format such expressions for readability.

@@ -63,7 +71,7 @@ export interface BaseParser {
/**
* Adds a new parser rule to the parser
*/
rule(rule: ParserRule, impl: RuleImpl): RuleResult;
rule(rule: ParserRule | InfixRule, impl: RuleImpl): RuleResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParserRule | InfixRule occur together quite often. Maybe add a common type and use that one instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also help for isParserRule(rule) || isInfixRule(rule), which I've seen at least twice.

@msujew msujew force-pushed the msujew/binary-operators branch from ea74ce1 to ca70591 Compare March 17, 2025 13:48
Copy link
Contributor

@spoenemann spoenemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feature! I have a few conceptual proposals.


InfixRuleOperators: precedences+=InfixRuleOperatorList ('>' precedences+=InfixRuleOperatorList)*;

InfixRuleOperatorList: operators+=Keyword ('|' operators+=Keyword)*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the associativity is left-to-right?

Shall we add an option to change that?

InfixRuleOperatorList:
    (associativity=Associativity 'assoc')?
    operators+=Keyword ('|' operators+=Keyword)*;

Associativity returns string: 'left' | 'right';

If omitted, the default should be left. With this, you can realize assignments as operators:

infix BinaryExpression on PrimaryExpression:
    '*' | '/'
    > '+' | '-'
    > right assoc '=' | '*=' | '/=' | '+=' | '-=';

Maybe also allow no assoc to forbid more than two operands on the same level?

@@ -74,6 +74,12 @@ ParserRule:
(definesHiddenTokens?='hidden' '(' (hiddenTokens+=[AbstractRule:ID] (',' hiddenTokens+=[AbstractRule:ID])*)? ')')? ':'
definition=Alternatives ';';

InfixRule: 'infix' RuleNameAndParams 'on' call=RuleCall ':' operators=InfixRuleOperators ';';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow infers and returns as usual. If a declared type is given, it must have the properties left, right, and operator with respective types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah 👉 #1836 (comment) – 👍 for a separate PR, but let's not forget that.

@@ -16,7 +16,7 @@ import { inRange } from './cst-utils.js';
* Link the `$container` and other related properties of every AST node that is directly contained
* in the given `node`.
*/
export function linkContentToContainer(node: AstNode): void {
export function linkContentToContainer(node: AstNode, deep = false): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to avoid boolean parameters and use an options object instead.

@spoenemann
Copy link
Contributor

To this discussion: #1836 (comment) (too bad GitHub doesn't have threads 😞)

I actually see value in standardizing the structure of BinaryOperator nodes. That can be very useful for other features that we might add later. So I vote for not giving the option to change the property names.

If you already have custom code relying on different property names, I'd use TypeScript rename refactoring to change it.

@ydaveluy
Copy link
Contributor

ydaveluy commented Apr 6, 2025

If I understand correctly, this PR involves a post-parsing refactoring for binary expressions.

I wonder if this case could be handled with a more generic approach:

Write the BinaryExpression rule without precedence:
BinaryExpression: lhs=BinaryExpression op=('*'|'/'|'-'|'+'|'|'|'&') rhs=BinaryExpression | PrimaryExpression;

Provide an API to register refactoring actions for the parser (similar to what is done for validation):

        actions: RefactoringAction<ast.TestAstType> = {
            BinaryExpression: refactorBinaryExpression
        };
    parser.register(actions);
    refactorBinaryExpression(expr: ast.BinaryExpression): void {
       if(ast.isBinaryExpression(expr.lhs) && isBinaryExpression(expr.rhs)) {
           // do refactoring
       }
    }

Provide a generic action to restructure the AST for binary expressions with operator precedence.

This would be easily customizable by the user and could allow other types of refactoring.

@msujew
Copy link
Member Author

msujew commented Apr 6, 2025

If I understand correctly, this PR involves a post-parsing refactoring for binary expressions.

Right, but that's not the only advantage. I mainly created this change to speed up expression parsing performance (between 40-50% in my tests). This is done by converting the infix parser rule into a flat list of expressions and then constructing a parse tree. Your proposed change still constructs a parse tree, which is then getting rewritten. This would unfortunately eliminate the performance benefits of this PR.

BinaryExpression: lhs=BinaryExpression op=('*'|'/'|'-'|'+'|'|'|'&') rhs=BinaryExpression | PrimaryExpression;

Note that this syntax is left recursive, and we don't support left recursion (due to using an LL parser).

@ydaveluy
Copy link
Contributor

ydaveluy commented Apr 6, 2025

Sorry the syntax was wrong...

Expression: PrimaryExpression ({BinaryExpression.lhs=current} operator=('*'|'/'|'-'|'+'|'|'|'&') rhs=PrimaryExpression)*;

@msujew
Copy link
Member Author

msujew commented Apr 6, 2025

Hm, I quite like the idea. I'll iterate on this during the course of the week :)

Edit: Thinking further about this, it becomes rather difficult to use this syntax to use the tree rewriting, mainly because it becomes impossible to decide which expressions are part of the "current" list of expressions, and which are nested (via usage of parenthesis, or similar).

@spoenemann spoenemann added this to the v4.0.0 milestone Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Parser related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a special syntax for binary operators
6 participants