Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reuse parsers without recreating them? #501

Closed
faassen opened this issue Aug 17, 2023 · 25 comments
Closed

How to reuse parsers without recreating them? #501

faassen opened this issue Aug 17, 2023 · 25 comments

Comments

@faassen
Copy link

faassen commented Aug 17, 2023

I have a use case where throughout the lifetime of the program I want to invoke the same parser many times. The naive approach is to invoke the parser() function each time, and then use parse() on the result. But as far as I understand, this recreates the parser each time, which means there is overhead that's not required.

I tried using std::sync::OnceLock with a boxed parser, but this doesn't appear to work because, I think, Boxed doesn't provide sync.

The alternative is to create the parser at program startup and store it in a struct so that I can later pass it along whenever needed.

But parsing like this ideally has a simple function-like invocation, so that you can use chumsky to implement things like the FromStr trait and the like.

Should this be supported? If so, what is the best approach to tackle this use case, and does it need documentation?

Note: I'm using chumsky 1.0a4

@faassen
Copy link
Author

faassen commented Aug 18, 2023

I've tried the simple approach above: store the parser required in a struct in advance so I create it only once, and then use it multiple times. Unfortunately I'm running into difficulties there too.

To store the parser in the struct I am using chumsky::Boxed. This takes two lifetimes, 'a and 'b. 'a is the lifetime of the input, 'b appears to be the lifetime of the DynParser that underlies the boxed parser, so I think the lifetime of storage.

The input passed into the parse function needs to live at least as long as 'a, but this 'a is now baked into the stored parser, and any input will live a lot more briefly.

I don't know how to express the lifetimes so that the lifetime of the input isn't baked into things.

@zesterer
Copy link
Owner

Unfortunately this is difficult to support. The current implementation has the lifetime of the input be part of the Parser trait (i.e: each parser is effectively tied to a specific input data lifetime and can't easily be reused for different data lifetimes). The ideal would be an API that looks something like this:

trait Parser {
    type Input<'a>: Input<'a>;
    type Output<'a>;

    fn parse<'src>(&self, input: Self::Input<'src>) -> Self::Output<'src> { ... }
}

But unfortunately the assumed invariance of GAT parameters means this doesn't generalise right now.

It might be that you simply must recreate the parser each time (alternatively you can unsafely transmute the parser to whatever lifetime you require, provided you make sure that the input and output lifetimes match).

That said, recreating the parser each time isn't going to be that expensive. It's not the sort of thing you should be doing in a very hot loop, but recreating it for each file you parse or each input in a REPL isn't going to have a meaningful performance impact.

@faassen
Copy link
Author

faassen commented Aug 19, 2023

Thanks for your answer! That's unfortunate.

I am currently using the Chumsky parser to parse strings at runtime in an interpreter implementation: parsing datetime formats and the like. It could totally happen in a hot loop, unfortunately. It sounds then like Chumsky isn't a good fit for parsing input like that. I didn't want to use regexes as the rules are somewhat involved and I think the chumsky code is more readable, but it sounds like I should reconsider that.

In another case I am reusing the language level parser to also parse the same thing during runtime and I didn't want to implement the same rules twice.

Would it be possible to support a cache in Chumsky itself? That is, Chumsky support memoization of parser functions? Since it's not possible to get this right outside of Chumsky, perhaps Chumsky should offer explicit support for this use case, if possible.

If not, I think it would be useful to discuss this limitation in the docs: do not use Chumsky if you want to parse at runtime during a hot loop.

@zesterer
Copy link
Owner

If you're not using recursive or boxed, no allocation needs to occur during parser creation and the process of creating a parser becomes effectively free (or, at least, the compiler is able to statically dispatch everything and optimise it as it pleases). If your syntax can be expressed via regex, none of the aforementioned combinators should be required anyway!

@zesterer
Copy link
Owner

Also, what do you mean by the 'cache' point here? Chumsky already supports memoisation (see .memoized()), but it sounds like this might not be what you're looking for?

@faassen
Copy link
Author

faassen commented Aug 20, 2023

I am using boxed as I am using a parser multiple times and therefore need to be able to clone it. I am unclear about when parsers are copy but they aren't always, especially if I compose parsers using separate functions. Maybe I can avoid it. In any case, this requires a lot of knowledge of details in order to know whether it's maybe safe to use a parser in a hot loop.

Concerning memoized parsers, my understanding is that memoized memoizes parser output given input. I am talking about memoizing the parser function itself. So:

#[cache]
fn my_parser() -> impl Parser<...> {
   ...
}

If you call my_parser twice, the second time it returns the parser from cache. This way I can do whatever I want during parser construction and have it be fast.

Without this, storing parsers is impossible and there are pitfalls if you reuse one (avoid hot loops, or avoid boxed, etc). With this, it's easy.

@zesterer
Copy link
Owner

zesterer commented Aug 20, 2023

So I was going to talk about how caching parsers in this way was not feasible in general... until I did some thinking and realised that, actually, it should be possible to design an API that allows it to be done safely and without sacrificing much in the way of versatility.

I've implemented an initial swing at such an API in the latest commit (01b96cd) and you can see an example of its usage in the tests here. It's currently behind the unstable feature flag since this is just an early attempt and is likely to change later (or perhaps it'll turn out to be unsound, I'm not sure).

The crux of it is: you implement Cached for a type (any old type will do, the type is just there to allow the trait to be implemented). When implementing the trait, you specify the input, output, and extra (i.e: error, context, etc.) types, along with providing a method that does the actual creating of the parser. Now, you get to use Cache to carry your parser around any way you like, with the ability to use the parser by calling .get().

Does this do the job for you?

@faassen
Copy link
Author

faassen commented Aug 21, 2023

Thank you so much, this looks like it would do the job!

A few comments:

  • I don't feel capable yet of judging soundness about this yet, so I can't offer feedback there.

  • There's a bit of commented code in the Cached trait type (Parser<'a>:)

  • The comment of Cache::get has a sentence that trails off

  • The implementation of a cached type is a bit verbose, given that all the information that goes into is already in the signature of the parser function. I take it that is necessary in order to let get() have the appropriate lifetimes. I imagine devising a macro that comes up with this automatically given a parser function would be tricky to implement, as it would effectively need to extract the type information from the returned parser, correct?

@stefnotch
Copy link
Contributor

stefnotch commented Aug 21, 2023

I'd have a use-case where I think the same solution would be very useful.

I have a table-driven Pratt parser, where the used symbols can be chosen at runtime. For example, one might want a math parser that only understands simple high school math. Meanwhile a different use case calls for a math parser that also understands units.(physics)

Which means that every once in a while, my code will construct a new parser at runtime.

And then, that parser is used to parse anywhere between 1 to a few hundred separate formulas.

@zesterer
Copy link
Owner

  • I don't feel capable yet of judging soundness about this yet, so I can't offer feedback there.

I've discussed this with CraftSpider and we're both pretty happy that this is sound, so all good on that front (MIRI also likes it).

  • There's a bit of commented code in the Cached trait type (Parser<'a>:)

Specifying the parser type would allow one to skip boxing the parser, at the slight inconvenience that the parser type must be specified.

  • The comment of Cache::get has a sentence that trails off

Thanks! I'll fix that.

  • The implementation of a cached type is a bit verbose, given that all the information that goes into is already in the signature of the parser function. I take it that is necessary in order to let get() have the appropriate lifetimes. I imagine devising a macro that comes up with this automatically given a parser function would be tricky to implement, as it would effectively need to extract the type information from the returned parser, correct?

Yes, I don't see an easy way to avoid this (without losing the generality of the solution, anyway). Sadly it's not possible to use HRTBs in this case since you'd need both higher-ranked and higher-kinded types to generalise over the input and output type lifetimes (unless you were to use a trait for each, but that's just moving the problem).

I'd have a use-case where I think the same solution would be very useful.

This definitely sounds like a good use-case!

@stefnotch
Copy link
Contributor

Unfortunately this is difficult to support. The current implementation has the lifetime of the input be part of the Parser trait (i.e: each parser is effectively tied to a specific input data lifetime and can't easily be reused for different data lifetimes). The ideal would be an API that looks something like this:

trait Parser {
    type Input<'a>: Input<'a>;
    type Output<'a>;

    fn parse<'src>(&self, input: Self::Input<'src>) -> Self::Output<'src> { ... }
}

But unfortunately the assumed invariance of GAT parameters means this doesn't generalise right now.

I'd love to understand this better. Is there a Rust issue that I could check out, or would you have any recommendations on what to read to understand advanced trait and lifetime topics better?

@zesterer
Copy link
Owner

The thing to search for is 'Generic Associated Types'.

@faassen
Copy link
Author

faassen commented Sep 21, 2023

I've finally found some time to try this out. The bad news first: I tried this on a big parser and it panics:

panicked at 'Recursive parser used before being defined', /home/faassen/.cargo/git/checkouts/chumsky-651fb76526ac39e3/b6c5df6/src/recursive.rs:155:18

The parser indeed uses recursive parsers, but I don't understand why caching it should have this effect.

Then a few implementation notes, perhaps useful to help improve the documentation:

My parser uses tokens defined by a logos lexer. In the parser, I used this generic parameter a lot:

where I: ValueInput<'a, Token = Token<'a>, Span = Span>

This caused a bit of a struggle. I want to define a Parsers struct that contains all the cached parsers, along these lines:

struct MyParserCache;

struct Parsers {
   my_parser_cache: chumsky::cache::Cache<MyParserCache>
}

But Parsers and thus MyParserCache cannot rely on a generic I parameter, otherwise we get the whole "source must outlive the duration of the program" namespace issue we started with. So what now? I landed on the following:

impl chumsky::cache::Cached for MyParserCache {
    type Parser<'src> = BoxedParser<
        'src,
        SpannedInput<
            Token<'src>,
            SimpleSpan,
            chumsky::input::BoxedStream<'src, (Token<'src>, SimpleSpan)>,
        >,
        ParsedAST,
    >;

    fn make_parser<'src>(self) -> Self::Parser<'src> {
        Parser::boxed(my_parser())
    }
}

For non-spanned scenario using BoxedStream with BoxedParser is sufficient, but I needed to wrap the whole thing in SpannedInput again. It took quite a bit of puzzling to get the types right.

But unfortunately after this I'm getting this crash. Looking at the source it appears RecursiveInner::Unowned cannot be upgraded for some reason.

@Zij-IT
Copy link
Contributor

Zij-IT commented Sep 21, 2023

I've finally found some time to try this out. The bad news first: I tried this on a big parser and it panics:

panicked at 'Recursive parser used before being defined', /home/faassen/.cargo/git/checkouts/chumsky-651fb76526ac39e3/b6c5df6/src/recursive.rs:155:18

The parser indeed uses recursive parsers, but I don't understand why caching it should have this effect.

Then a few implementation notes, perhaps useful to help improve the documentation:

My parser uses tokens defined by a logos lexer. In the parser, I used this generic parameter a lot:

where I: ValueInput<'a, Token = Token<'a>, Span = Span>

This caused a bit of a struggle. I want to define a Parsers struct that contains all the cached parsers, along these lines:

struct MyParserCache;

struct Parsers {
   my_parser_cache: chumsky::cache::Cache<MyParserCache>
}

But Parsers and thus MyParserCache cannot rely on a generic I parameter, otherwise we get the whole "source must outlive the duration of the program" namespace issue we started with. So what now? I landed on the following:

impl chumsky::cache::Cached for MyParserCache {
    type Parser<'src> = BoxedParser<
        'src,
        SpannedInput<
            Token<'src>,
            SimpleSpan,
            chumsky::input::BoxedStream<'src, (Token<'src>, SimpleSpan)>,
        >,
        ParsedAST,
    >;

    fn make_parser<'src>(self) -> Self::Parser<'src> {
        Parser::boxed(my_parser())
    }
}

For non-spanned scenario using BoxedStream with BoxedParser is sufficient, but I needed to wrap the whole thing in SpannedInput again. It took quite a bit of puzzling to get the types right.

But unfortunately after this I'm getting this crash. Looking at the source it appears RecursiveInner::Unowned cannot be upgraded for some reason.

Is your repo public by chance? Without being able to see your code, there’s no way for me to be able to test it and probe around properly

@faassen
Copy link
Author

faassen commented Sep 21, 2023

Unfortunately not (yet); and it's a huge codebase anyhow. I can try to create a replication - what I will try is using parser recursion at all to trigger, but if it's more complicated than that it will take a while.

@Zij-IT
Copy link
Contributor

Zij-IT commented Sep 21, 2023

It would be great if you could get a MRE of the issue with the parser going, that way we are taking more than just some stabs in the dark.

@faassen
Copy link
Author

faassen commented Sep 22, 2023

Okay, I just tried a simple recursive parser with caching and it doesn't crash. So unfortunately it looks like I'm going to do more work to reproduce this behavior.

@faassen
Copy link
Author

faassen commented Sep 22, 2023

All right, in the process of stripping down this massive real world parser in the hope to find a minimal reproducing example, I realized I was doing a pretty hacky thing with recursive:

let  mut another_parser = None

let parser = recursive(|parser| { 
   let internal_parser = ...
   another_parser = Some(internal_parser);
  // stuff
});

I did this because I needed both the internal_parser as well as parser, but obviously it's quite hacky. I rewrote the code to use Recursive::declare and Recursive::define and the crash went away.

@faassen
Copy link
Author

faassen commented Sep 22, 2023

Okay, I think I triggered a memory leak somewhere. When I use a cached parser (and thus do not recreate it), the memory usage of the process rises alarmingly (by gigabytes) and eventually it's killed off by Linux. This is when I'm using a large cached parser on the order of 10k times. Strangely enough if I recreate the cache for each run (and thus no caching should take place), the memory leak also happen; just the fact that the cache is in use seems to be enough.

When I don't use the parser caching in the same run the memory usage stays below 30 megabytes.

P.S. Again a reproducible example is probably quite a bit of effort, as my parser is significantly complex.

@Zij-IT
Copy link
Contributor

Zij-IT commented Sep 22, 2023

Okay, I think I triggered a memory leak somewhere. When I use a cached parser (and thus do not recreate it), the memory usage of the process rises alarmingly (by gigabytes) and eventually it's killed off by Linux. This is when I'm using a large cached parser on the order of 10k times. Strangely enough if I recreate the cache for each run (and thus no caching should take place), the memory leak also happen; just the fact that the cache is in use seems to be enough.

When I don't use the parser caching in the same run the memory usage stays below 30 megabytes.

P.S. Again a reproducible example is probably quite a bit of effort, as my parser is significantly complex.

Without knowing how your parser looks and purely based on your comment, I wonder if you’ve accidentally created mutually recursive parsers that, in an attempt to define the first, attempts to define the second until that loop results in Linux killing it.

Akin to: “to parse a Rust statement I must know how to parse an expression, but to parse a block expression I must know how to parse a Rust statement”

@faassen
Copy link
Author

faassen commented Sep 22, 2023

I wonder if you’ve accidentally created mutually recursive parsers that, in an attempt to define the first, attempts to
define the second until that loop results in Linux killing it.

I don't think so. The parser definition is the same. The parser works - it parses. It works thousands of times, even. It's just that if I use the caching mechanism (and only then) the memory grows unbounded.

@faassen
Copy link
Author

faassen commented Sep 22, 2023

@Zij-IT, you got me thinking and do some extra testing. If I just make the change to use Recursive.declare() then the parser memory usage explodes, without any use of caching whatsoever. So it's not due to the caching logic but the switch to use Recursive.declare(). But I switched to that as without it my hack makes the caching system crash. And to repeat: this parser works, it's just extremely memory intensive.

I will note that the docs claim that recursive is a convenience wrapper around define and declare but looking at the source it looks like that's not what is happening.

@Zij-IT
Copy link
Contributor

Zij-IT commented Sep 22, 2023

@faassen For whatever reason I thought that #494 was already merged into main, but that's not the case yet. As #486 shows, there is a leak when using declare and define.

@faassen
Copy link
Author

faassen commented Sep 22, 2023

That explains it then! So this means I am stuck: I can't cache this parser until I can create this type of recursive structure without a leak.

@zesterer
Copy link
Owner

I'm going to close this issue since the specific case here (recursive being leaky under some circumstances) is a distinct issue to the original now that the cache system has merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants