-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Info on the fastest Earley impl #26
Comments
Thank you for letting me know about your implementation. I'll play with it when I have a spare time.
|
Sorry, I can not build your parser because rust compiler reports errors on your sources. Also according to my benchmarking YAEP is 2.5-6 times slower than YACC on parsing big C files (60K and 600K C lines). It means that your parser is faster than YACC and this makes me suspect that something is wrong with your measurements or you measurement approach. |
My algorithm has three passes, like Jeffrey Kegler's. I am eliminating nullable symbols. I added some highly effective optimizations with a binary heap. Besides that, it is similar to Jeffrey Kegler's. I am not caching Earley sets like YAEP. My algorithm does not have Leo's optimization. If I implement it, it will be with a rewrite from right-recursive to left-recursive rules. I didn't implement error recovery in my parser yet. You can build my parser using the nightly compiler. It probably does not work on stable. You can switch using I believe my results are correct, because 602 ns per token is not too little. |
Do the YAEP/YACC measurements include the definition and processing of a grammar, that is e.g. building the LALR(1) tables? My benchmarks include processing of the grammar. |
Thank you for your parser description.
OK.
I just run the same test you use. I used i7-8700K under Fedora Core 28:
So the speed of the parser is pretty the same (31ms vs 35ms). |
I don't build any LA(LR)(k) tables. The grammar is taken as yacc-like file. The measurement I've just posted does not include processing grammar (yaep_parse_grammar). Actually after running 10 times the test, the minimal time I have:
In any case, the speed can be vary a bit depending what we are taking into consideration but it is not 7 times difference you posted. |
Oh okay. The similar times and memory usages make a lot of sense. I will try to figure out why YAEP was so slow in my runs. To optimize, I would have to cut down on the number of completions. That's the only way to optimize. My code spends like, 45% time on doing completions. I also have an idea of writing a parser in assembly. I wonder if SIMD could help with performance. Could you try my benchmark without grammar processing, so that it is an apples to apples comparison? You can do that by moving the |
I didn't mean to say you use LA(LR)(k), I meant to say YACC does. In case I incorporated LL(1) into my parser, would you consider it faster, or would you consider that cheating? I guess I might cache completions, because I detected 1000 unique completions for these 10K lines. |
Sorry, I probably did not understand you (your messages seem too vague for me). YAEP uses YACC to generate a parser for YAEP grammar description. This grammar parser work is not included in You can see how YEAP ./a.out mentioned above is built and how the time is measured in compare_parser.tst.in.
I don't understand what you are trying to say. I have no intention to compete with you. Let an user decide what to use gearley, MARPA, or YAEP. The more implementations, the better. I probably add your implementation to the script for the comparison when your implementation will be stabilized and released as I did it for MARPA.
Good, that is what my parser does. It can speed up your implementation too. |
My attempts at optimization failed so far. I would like to optimize by caching all computations for an Earley set. I will hash all Earley sets individually and compare hashes of used sets. Are you doing this in YAEP? |
I wrote YAEP about 20 years ago. So my current knowledge may be not accurate. I don't remember well terminology too. I tried to cache full LR-sets but it did not work. The problem are in distances. They are varying a lot. So I divide LR-sets on LR-situattions and separately vector of distances. LR-set situations w/o distances are repeated many times. So in the working set I have only a LR-set reference and vector of distances for corresponding LR-situations in the LR-set. Moreover I keep only minimal part of LR-set situations as other LR-set situations can be built from them. Zero distances for the built LR-set situations can be omitted too. For detail info, you probably need to look at the source code. Code for building the work set should be pretty straightforward. Code for building parsing trees is more complicated. I hope this info can help you. |
Thank you. I'm currently implementing this new optimization with set hashing and a Patricia trie. I found another small optimization: Earley items can be shrunk by 1/3 by tying the ordering of SPPF links to the ordering of item origins. (I sort items by origin in two places.) Then you remove the origins and use SPPF links instead. Now, Earley items take 8 bytes. It has yet to be seen if this results in a performace and/or memory use improvement. |
What you call distances, I call item origins after Jeffrey Kegler. I found that distances can be removed with very little complication for the algorithm. I found that I can retrieve 7000 out of 8500 Earley items from cache in a small test. The other 1500 sets need unique computations to produce. |
Sorry for slow response. I was (and am still) a bit busy. Thank you for the update. Although I did not work on this topic for a long time, it is still interesting to me. I am not familiar with Jeffrey Kegler's implementation although he did a lot for recent Earley's parser popularization and I should have paid more attention to his approach. Considering meaning of origin, I suspect distance is a bit different. Probably origin is absolute number of the set (in working list), distances are relative numbers. As parsed string may have pretty similar substrings, the distances can be the same and origins don't. So vector of distances can have a big probability to be present many times and can be stored in one exemplar. But may be I am wrong and his origins are the same as my distances. |
New measurements show that 610 out of 12600 Earley sets can be cached, on the C benchmark. The cache is huge, so probably not worth using. I will try to cache completions. |
My benchmarks include processing the grammar. New measurements were wrong, I can skip processing for 7000 out of 8500 Earley items. I'm having some issues with the Patricia trie implementation, but I will resolve these and get new benchmarks running within a few days. |
I made an Earley engine that is much faster in benchmarks than YAEP: gearley. Here are the results for 88248 tokens (10K lines of C).
Memory use
So, the Earley sets take up only 4600 bytes. That's one byte every 19 tokens. Gearley with a compact forest takes only 17.4% of the memory YAEP does. However, a compact forest slows down the benchmark by 34 milliseconds. With normal forest, the parse takes 14% time compared to a parse with YAEP, and the same amount of memory.
Compact forest takes 25 bytes per token. Gearley with the default forest takes 602 ns per token.
I invite you to make your own benchmarks.
The text was updated successfully, but these errors were encountered: