Skip to content

Parser performance with ambiguity #1514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sternj opened this issue Mar 2, 2025 · 16 comments
Open

Parser performance with ambiguity #1514

sternj opened this issue Mar 2, 2025 · 16 comments
Labels

Comments

@sternj
Copy link

sternj commented Mar 2, 2025

I'm trying to write a parser that deals with some syntactic elements inside of Markdown without affecting the rest of the document. I've been dealing with a lot of issues surrounding the performance of the grammar. In trying to convert it into LALR(1), I've found that there are a lot of ambiguities. The root of these is that a lot of the syntax uses characters that are legal in other contexts as well-- for instance, the phrase

asdf `{some syntax}`

could (in theory) be validly parsed as either just a string or as one string token asdf and then the rule corresponding to {some syntax}. There are also some multiline constructs and some list handling has to happen. Further complicating things is the fact that I want to preserve whitespace outside of the syntax, so ignoring it isn't really viable

So, the grammar's slow-- there are a good few similar issues to the above. When I tried to change the parser to LALR(1), as a consequence, it did things like regarding a NEWLINE as part of the preceding paragraph rather than as the boundary between two list items. I'm kinda at my wit's end here-- I have a grammar that works well, but is incredibly slow in Earley and is unusable with other parsers. I'd appreciate any guidance on making it run at any speed.

@sternj sternj added the question label Mar 2, 2025
@MegaIng
Copy link
Member

MegaIng commented Mar 2, 2025

markdown is not context free. Use a dedicated, hand written, parser for it, lark will never be able to handle it well.

@erezsh
Copy link
Member

erezsh commented Mar 2, 2025

I think that's a bit too dramatic. It might be possible to create an adequate solution, if he uses Lark through its interactive-parser interface.

@sternj
Copy link
Author

sternj commented Mar 2, 2025 via email

@MegaIng
Copy link
Member

MegaIng commented Mar 2, 2025

Hm, yeah, that restriction might be just about possible, especially if you can disallow ` appearing elsewhere. What normally causes trouble is the actually ambigous stuff about *, **, _, __ and the nestings thereof. (since it's very easy to make this require arbitrary lookahead: ___Hello_ World__ = Hello World vs ___Hello__ World_= Hello World)

@sternj
Copy link
Author

sternj commented Mar 2, 2025 via email

@sternj
Copy link
Author

sternj commented Mar 2, 2025

Alright, here's a sanitized version of the grammar I'm working with. I did have to change some details, so there might be some transcription errors. I know the explicit use of WS_INLINE is a bit suboptimal, but in most of these cases I'm trying to preserve whitespace in the document. There are also some productions of text that I needed to ensure were discrete tokens, so all in all it is a mess of a grammar.

start: content
content: (( simple_unary | intermediate_text | list_construct | codeblock | nothing_in_middle |   text|  inline_item) WS_INLINE* NEWLINE?)+
simple_unary.4: "[[" _simple_unary_item "]]" 

_simple_unary_item.4: SIMPLE_UNARY_ITEM ("|" SIMPLE_UNARY_ITEM ("," SIMPLE_UNARY_ITEM)*)?
SIMPLE_UNARY_ITEM: /[^\[\]]+/ 

intermediate_text.5: "`[intermediate" WS_INLINE+ _some_name? WS_INLINE+ "=" WS_INLINE* "`" WS* content WS* "`intermediate]`"
_some_name: IDENTIFIER
_bullet_list: ((list_item | codeblock) NEWLINE+)+
_CODE_HEAD: "```@lang" WS_INLINE* NEWLINE 
_CODE_TAIL:  "```" NEWLINE+
_LIST_CONSTRUCT_HEAD: _CODE_HEAD "list_construct" NEWLINE _CODE_TAIL

list_construct.3: _LIST_CONSTRUCT_HEAD _bullet_list _LIST_CONSTRUCT_TAIL
_LIST_CONSTRUCT_TAIL: _CODE_HEAD "end_list_construct" NEWLINE _CODE_TAIL
!_bullet:  WS_INLINE* "-" WS_INLINE+
inline_item.3: "[%" /.+?(?=%])/ "%]" | "[%" NEWLINE? _bullet_list  NEWLINE? "%]"
list_item.4: _bullet content (NEWLINE+ WS_INLINE* content)*

nothing_in_middle.4: "`[" WS_INLINE* "nothing_in_middle" WS_INLINE+ IDENTIFIER WS_INLINE* "]`"
IDENTIFIER: /[a-zA-Z_][a-zA-Z_0-9]*/

!text: (_TEXT | "[" | "]" | "`" | "=")+
_TEXT: /[^\n\[\]`=]+/
codeblock.2: "```@lang" WS_INLINE* NEWLINE /[^`]+/ "```"

%import common.NEWLINE
%import common.WS_INLINE
%import common.WS

@erezsh
Copy link
Member

erezsh commented Mar 2, 2025

Your grammar is very inefficient. Many of your rules can be joined into a single terminal.

e.g.

!text: (_TEXT | "[" | "]" | "`" | "=")+

Can be written as a single regexp, and it would performance MUCH better than as a rule. Perhaps some of these fragments are needed to handle ambiguity, but I'd bet many of them don't.

@erezsh
Copy link
Member

erezsh commented Mar 2, 2025

Also you should share how you're running Lark, i.e. what parameters you're using.

@sternj
Copy link
Author

sternj commented Mar 2, 2025

A decent few of them are needed for the ambiguity resolution-- the one you pointed out is one of those. Previous versions of that rule led the tokenizer to have a TEXT token ending with something random here[ and then complain that it expected a [.

As for parameters-- Earley parser, dynamic lexer (all defaults).

@erezsh
Copy link
Member

erezsh commented Mar 2, 2025

Any reason you can't do /[|]|=|...?

@sternj
Copy link
Author

sternj commented Mar 2, 2025

That's a good idea, I'll put that in and see what that gets me. I'm also seeing _bullet as a rule and nothing_in_middle as a rule that don't need to be, I'll see whether making simple_unary a terminal still lets me distinguish simple_unary_items. Anything else jump out at you?

@erezsh
Copy link
Member

erezsh commented Mar 3, 2025

The biggest ones are WS_INLINE+ WS_INLINE* and all the rest. They need to be a terminal, not a rule.

@sternj
Copy link
Author

sternj commented Mar 3, 2025 via email

@erezsh
Copy link
Member

erezsh commented Mar 3, 2025

Yes, exactly. WS_INLINE+ can be WS_INLINE_REPEAT and WS_INLINE* as WS_INLINE_REPEAT?

@sternj
Copy link
Author

sternj commented Mar 4, 2025

Thank you for all the help there, I'm a few hours into reworking this and have run into a confounding issue (vaguely related).

I have this grammar

start: _LIST_CONSTRUCT_TAIL
TRIPLE_BACKTICK: "`" "`" "`"
_CODE_HEAD: TRIPLE_BACKTICK "@lol" WS_INLINE* NEWLINE 
_CODE_TAIL:  TRIPLE_BACKTICK NEWLINE+ 

_LIST_CONSTRUCT_TAIL: _CODE_HEAD "end_list_construct" NEWLINE _CODE_TAIL

%import common.WS_INLINE
%import common.NEWLINE

(which I have tried with _CODE_HEAD in various forms including

```@lang

and

TRIPLE_BACKTICK: "```"

on this input:

```@lang
end_list_construct
`` `
(note that is not input: space before last backtick to get Markdown to work right on Github) 

and consistently get the error

```@lang

^
Expected one of: 
	* _LIST_CONSTRUCT_TAIL

Any ideas here?

@sternj
Copy link
Author

sternj commented Mar 6, 2025

Another ping on this because I think it's probably a PIBKAC, but I can't for the life of me tell what it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants