-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to reuse parsers without recreating them? #501
Comments
I've tried the simple approach above: store the parser required in a struct in advance so I create it only once, and then use it multiple times. Unfortunately I'm running into difficulties there too. To store the parser in the struct I am using The input passed into the I don't know how to express the lifetimes so that the lifetime of the input isn't baked into things. |
Unfortunately this is difficult to support. The current implementation has the lifetime of the input be part of the trait Parser {
type Input<'a>: Input<'a>;
type Output<'a>;
fn parse<'src>(&self, input: Self::Input<'src>) -> Self::Output<'src> { ... }
} But unfortunately the assumed invariance of GAT parameters means this doesn't generalise right now. It might be that you simply must recreate the parser each time (alternatively you can unsafely transmute the parser to whatever lifetime you require, provided you make sure that the input and output lifetimes match). That said, recreating the parser each time isn't going to be that expensive. It's not the sort of thing you should be doing in a very hot loop, but recreating it for each file you parse or each input in a REPL isn't going to have a meaningful performance impact. |
Thanks for your answer! That's unfortunate. I am currently using the Chumsky parser to parse strings at runtime in an interpreter implementation: parsing datetime formats and the like. It could totally happen in a hot loop, unfortunately. It sounds then like Chumsky isn't a good fit for parsing input like that. I didn't want to use regexes as the rules are somewhat involved and I think the chumsky code is more readable, but it sounds like I should reconsider that. In another case I am reusing the language level parser to also parse the same thing during runtime and I didn't want to implement the same rules twice. Would it be possible to support a cache in Chumsky itself? That is, Chumsky support memoization of parser functions? Since it's not possible to get this right outside of Chumsky, perhaps Chumsky should offer explicit support for this use case, if possible. If not, I think it would be useful to discuss this limitation in the docs: do not use Chumsky if you want to parse at runtime during a hot loop. |
If you're not using |
Also, what do you mean by the 'cache' point here? Chumsky already supports memoisation (see |
I am using boxed as I am using a parser multiple times and therefore need to be able to clone it. I am unclear about when parsers are Concerning memoized parsers, my understanding is that #[cache]
fn my_parser() -> impl Parser<...> {
...
} If you call Without this, storing parsers is impossible and there are pitfalls if you reuse one (avoid hot loops, or avoid |
So I was going to talk about how caching parsers in this way was not feasible in general... until I did some thinking and realised that, actually, it should be possible to design an API that allows it to be done safely and without sacrificing much in the way of versatility. I've implemented an initial swing at such an API in the latest commit (01b96cd) and you can see an example of its usage in the tests here. It's currently behind the The crux of it is: you implement Does this do the job for you? |
Thank you so much, this looks like it would do the job! A few comments:
|
I'd have a use-case where I think the same solution would be very useful. I have a table-driven Pratt parser, where the used symbols can be chosen at runtime. For example, one might want a math parser that only understands simple high school math. Meanwhile a different use case calls for a math parser that also understands units.(physics) Which means that every once in a while, my code will construct a new parser at runtime. And then, that parser is used to parse anywhere between 1 to a few hundred separate formulas. |
I've discussed this with CraftSpider and we're both pretty happy that this is sound, so all good on that front (MIRI also likes it).
Specifying the parser type would allow one to skip boxing the parser, at the slight inconvenience that the parser type must be specified.
Thanks! I'll fix that.
Yes, I don't see an easy way to avoid this (without losing the generality of the solution, anyway). Sadly it's not possible to use HRTBs in this case since you'd need both higher-ranked and higher-kinded types to generalise over the input and output type lifetimes (unless you were to use a trait for each, but that's just moving the problem).
This definitely sounds like a good use-case! |
I'd love to understand this better. Is there a Rust issue that I could check out, or would you have any recommendations on what to read to understand advanced trait and lifetime topics better? |
The thing to search for is 'Generic Associated Types'. |
I've finally found some time to try this out. The bad news first: I tried this on a big parser and it panics:
The parser indeed uses recursive parsers, but I don't understand why caching it should have this effect. Then a few implementation notes, perhaps useful to help improve the documentation: My parser uses tokens defined by a where I: ValueInput<'a, Token = Token<'a>, Span = Span> This caused a bit of a struggle. I want to define a struct MyParserCache;
struct Parsers {
my_parser_cache: chumsky::cache::Cache<MyParserCache>
} But impl chumsky::cache::Cached for MyParserCache {
type Parser<'src> = BoxedParser<
'src,
SpannedInput<
Token<'src>,
SimpleSpan,
chumsky::input::BoxedStream<'src, (Token<'src>, SimpleSpan)>,
>,
ParsedAST,
>;
fn make_parser<'src>(self) -> Self::Parser<'src> {
Parser::boxed(my_parser())
}
} For non-spanned scenario using But unfortunately after this I'm getting this crash. Looking at the source it appears |
Is your repo public by chance? Without being able to see your code, there’s no way for me to be able to test it and probe around properly |
Unfortunately not (yet); and it's a huge codebase anyhow. I can try to create a replication - what I will try is using parser recursion at all to trigger, but if it's more complicated than that it will take a while. |
It would be great if you could get a MRE of the issue with the parser going, that way we are taking more than just some stabs in the dark. |
Okay, I just tried a simple recursive parser with caching and it doesn't crash. So unfortunately it looks like I'm going to do more work to reproduce this behavior. |
All right, in the process of stripping down this massive real world parser in the hope to find a minimal reproducing example, I realized I was doing a pretty hacky thing with let mut another_parser = None
let parser = recursive(|parser| {
let internal_parser = ...
another_parser = Some(internal_parser);
// stuff
}); I did this because I needed both the |
Okay, I think I triggered a memory leak somewhere. When I use a cached parser (and thus do not recreate it), the memory usage of the process rises alarmingly (by gigabytes) and eventually it's killed off by Linux. This is when I'm using a large cached parser on the order of 10k times. Strangely enough if I recreate the cache for each run (and thus no caching should take place), the memory leak also happen; just the fact that the cache is in use seems to be enough. When I don't use the parser caching in the same run the memory usage stays below 30 megabytes. P.S. Again a reproducible example is probably quite a bit of effort, as my parser is significantly complex. |
Without knowing how your parser looks and purely based on your comment, I wonder if you’ve accidentally created mutually recursive parsers that, in an attempt to define the first, attempts to define the second until that loop results in Linux killing it. Akin to: “to parse a Rust statement I must know how to parse an expression, but to parse a block expression I must know how to parse a Rust statement” |
I don't think so. The parser definition is the same. The parser works - it parses. It works thousands of times, even. It's just that if I use the caching mechanism (and only then) the memory grows unbounded. |
@Zij-IT, you got me thinking and do some extra testing. If I just make the change to use I will note that the docs claim that |
That explains it then! So this means I am stuck: I can't cache this parser until I can create this type of recursive structure without a leak. |
I'm going to close this issue since the specific case here ( |
I have a use case where throughout the lifetime of the program I want to invoke the same parser many times. The naive approach is to invoke the
parser()
function each time, and then useparse()
on the result. But as far as I understand, this recreates the parser each time, which means there is overhead that's not required.I tried using
std::sync::OnceLock
with a boxed parser, but this doesn't appear to work because, I think, Boxed doesn't providesync
.The alternative is to create the parser at program startup and store it in a struct so that I can later pass it along whenever needed.
But parsing like this ideally has a simple function-like invocation, so that you can use chumsky to implement things like the
FromStr
trait and the like.Should this be supported? If so, what is the best approach to tackle this use case, and does it need documentation?
Note: I'm using chumsky 1.0a4
The text was updated successfully, but these errors were encountered: