Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to search for parseable substrings #1371

Open
DRMacIver opened this issue Nov 20, 2023 · 6 comments
Open

Ability to search for parseable substrings #1371

DRMacIver opened this issue Nov 20, 2023 · 6 comments

Comments

@DRMacIver
Copy link

Suggestion

I would be like to be able to use Lark to search for all substrings of a string matching some Lark grammar. e.g. something along the lines of the following method on the Lark class:

def search(self, input):
    """
    Yields a pair of triples `(start, end, parse_result)` of all regions of `input` containing
    a parseable string, together with their parse result.
    """

(I'm not wed to this exact interface)

Describe alternatives you've considered

Currently my least bad alternative for something like this is to write my own specialised parser tool, which I'd rather not do.

The more likely solution I'd adopt is just use regex and hope I can work around the cases where I need a better grammar than that supports.

I've tried doing this with Lark using a heavily ambiguous grammar that has just arbitrary amounts of noise at the beginning and end and ambiguous parse forests. It doesn't seem to work very reliably (and I expect is highly inefficient as a solution), but I may investigate this option further.

Additional context

If you're curious the specific use case I have for this is in implementing test-case reduction (delta debugging, C-reduce, etc). The idea is to have lots of little grammars that define common structures that might appear in various file formats (e.g. arithmetic expressions) and then parse out the parts of the file that match those grammars and make transformations to only that region, based on the parse. This means that I need to be able to parse arbitrary unknown inputs, while only caring about the bits of them that match a particular grammar.

I don't really have a good sense of Lark internals (though I've happily used it a few times, and am... certainly not great at parsing theory, but also not completely ignorant), so I don't know whether where what I'm asking falls on a spectrum of "basically trivial" to "literally impossible with the current implementation". I'm happy to have a go at the implementation work if there's no other interest in working on this, but would appreciate some pointers as to where to start and how difficult it's likely to be.

@erezsh
Copy link
Member

erezsh commented Nov 20, 2023

Lark isn't made for scanning text. However that question has been asked many times. To my recollection, there are two ways about this -

  1. Utilize Earley with a prefix and postfix of /.*/. As you mentioned, this is very inefficient for large inputs. However, it's relatively simple to implement.

  2. Attempt to parse at every possible starting position, until a successful parse is achieved. Can be done with either Lark or Earley. Not too difficult to do either, and can be reasonably efficient, if there's a low number of "false positives".

Similar issues:

@DRMacIver
Copy link
Author

Ah, apologies. I should have searched previous issues for duplicates.

I don't think either of those are likely to be viable for me unfortunately, as I need this to work on quite large inputs, so I think I'm back to implementing my own.

Thanks for the quick response!

@MegaIng
Copy link
Member

MegaIng commented Nov 20, 2023

If all your terminals in the grammar are non-conflicting (i.e. using lexer='basic' works) then it should be very doable to implement a quite efficient search method.

@MegaIng
Copy link
Member

MegaIng commented Nov 20, 2023

I created a small branch of Lark that includes a .scan method that might be able to handle what you want. If it generally does what you want, we could look into polishing it up a bit. Most importantly, it currently creates copies of the input text: https://github.com/MegaIng/lark/tree/add-scanning

This small examples works:

grammar = r"""
template: "{{" TEXT ("|" argument)* "}}"
argument: TEXT "=" TEXT -> named_argument
        | TEXT -> numbered_argument

TEXT: /[^={}|]+/
"""
parser = lark.Lark(grammar, start="template", parser="lalr")

for result in parser.scan(open("python.wiki", encoding="utf-8").read()):
    print(result)

for result in parser.scan("{{a}}{{b}}"):
    print(result)

(it extracts templates from wikitext)

@erezsh
Copy link
Member

erezsh commented Feb 3, 2024

@MegaIng Btw, if you tidy up that implementation a bit and show that it passes a few stress tests, I'm inclined to add it as an official API.

I'm also curious if in the search implementation, maybe doing text = text[:best] might improve the performance.

@MegaIng
Copy link
Member

MegaIng commented Feb 3, 2024

@MegaIng Btw, if you tidy up that implementation a bit and show that it passes a few stress tests, I'm inclined to add it as an official API.

Will do at some point.

I'm also curious if in the search implementation, maybe doing text = text[:best] might improve the performance.

I think it would be better to avoid creating copies of the input string. An non-absurd usecase for this feature is search in log files that might be multiple 100 MB large. Instead I would like to add start and end all over our parse apis, which also first easily into regex since those apis all also accept those things. Potentially it might even make sense to expose our LineCounter class and make start and end accept that as a parameter so that the line numbers are correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants