-
-
Notifications
You must be signed in to change notification settings - Fork 436
Parser performance with ambiguity #1514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
markdown is not context free. Use a dedicated, hand written, parser for it, lark will never be able to handle it well. |
I think that's a bit too dramatic. It might be possible to create an adequate solution, if he uses Lark through its interactive-parser interface. |
Hm, yeah, that restriction might be just about possible, especially if you can disallow |
Alright, here's a sanitized version of the grammar I'm working with. I did have to change some details, so there might be some transcription errors. I know the explicit use of
|
Your grammar is very inefficient. Many of your rules can be joined into a single terminal. e.g. !text: (_TEXT | "[" | "]" | "`" | "=")+ Can be written as a single regexp, and it would performance MUCH better than as a rule. Perhaps some of these fragments are needed to handle ambiguity, but I'd bet many of them don't. |
Also you should share how you're running Lark, i.e. what parameters you're using. |
A decent few of them are needed for the ambiguity resolution-- the one you pointed out is one of those. Previous versions of that rule led the tokenizer to have a As for parameters-- Earley parser, dynamic lexer (all defaults). |
Any reason you can't do |
That's a good idea, I'll put that in and see what that gets me. I'm also seeing _bullet as a rule and nothing_in_middle as a rule that don't need to be, I'll see whether making simple_unary a terminal still lets me distinguish simple_unary_items. Anything else jump out at you? |
The biggest ones are |
Yes, exactly. |
Thank you for all the help there, I'm a few hours into reworking this and have run into a confounding issue (vaguely related). I have this grammar
(which I have tried with
and
on this input:
and consistently get the error
Any ideas here? |
Another ping on this because I think it's probably a PIBKAC, but I can't for the life of me tell what it is |
I'm trying to write a parser that deals with some syntactic elements inside of Markdown without affecting the rest of the document. I've been dealing with a lot of issues surrounding the performance of the grammar. In trying to convert it into LALR(1), I've found that there are a lot of ambiguities. The root of these is that a lot of the syntax uses characters that are legal in other contexts as well-- for instance, the phrase
could (in theory) be validly parsed as either just a string or as one string token
asdf
and then the rule corresponding to{some syntax}
. There are also some multiline constructs and some list handling has to happen. Further complicating things is the fact that I want to preserve whitespace outside of the syntax, so ignoring it isn't really viableSo, the grammar's slow-- there are a good few similar issues to the above. When I tried to change the parser to LALR(1), as a consequence, it did things like regarding a NEWLINE as part of the preceding paragraph rather than as the boundary between two list items. I'm kinda at my wit's end here-- I have a grammar that works well, but is incredibly slow in Earley and is unusable with other parsers. I'd appreciate any guidance on making it run at any speed.
The text was updated successfully, but these errors were encountered: