-
-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to search for parseable substrings #1371
Comments
Lark isn't made for scanning text. However that question has been asked many times. To my recollection, there are two ways about this -
Similar issues: |
Ah, apologies. I should have searched previous issues for duplicates. I don't think either of those are likely to be viable for me unfortunately, as I need this to work on quite large inputs, so I think I'm back to implementing my own. Thanks for the quick response! |
If all your terminals in the grammar are non-conflicting (i.e. using |
I created a small branch of This small examples works:
(it extracts templates from wikitext) |
@MegaIng Btw, if you tidy up that implementation a bit and show that it passes a few stress tests, I'm inclined to add it as an official API. I'm also curious if in the search implementation, maybe doing |
Will do at some point.
I think it would be better to avoid creating copies of the input string. An non-absurd usecase for this feature is search in log files that might be multiple 100 MB large. Instead I would like to add |
Suggestion
I would be like to be able to use Lark to search for all substrings of a string matching some Lark grammar. e.g. something along the lines of the following method on the Lark class:
(I'm not wed to this exact interface)
Describe alternatives you've considered
Currently my least bad alternative for something like this is to write my own specialised parser tool, which I'd rather not do.
The more likely solution I'd adopt is just use regex and hope I can work around the cases where I need a better grammar than that supports.
I've tried doing this with Lark using a heavily ambiguous grammar that has just arbitrary amounts of noise at the beginning and end and ambiguous parse forests. It doesn't seem to work very reliably (and I expect is highly inefficient as a solution), but I may investigate this option further.
Additional context
If you're curious the specific use case I have for this is in implementing test-case reduction (delta debugging, C-reduce, etc). The idea is to have lots of little grammars that define common structures that might appear in various file formats (e.g. arithmetic expressions) and then parse out the parts of the file that match those grammars and make transformations to only that region, based on the parse. This means that I need to be able to parse arbitrary unknown inputs, while only caring about the bits of them that match a particular grammar.
I don't really have a good sense of Lark internals (though I've happily used it a few times, and am... certainly not great at parsing theory, but also not completely ignorant), so I don't know whether where what I'm asking falls on a spectrum of "basically trivial" to "literally impossible with the current implementation". I'm happy to have a go at the implementation work if there's no other interest in working on this, but would appreciate some pointers as to where to start and how difficult it's likely to be.
The text was updated successfully, but these errors were encountered: