Description
Hi!
Writing this for the history. Maybe these results will be interesting to someone who trying to achieve better performance with tokenizers
since the project cares about performance.
I test Profile-Guided Optimization (PGO) on different kinds of software - the current results are available here (with a lot of other PGO-related information). That's why I tried to optimize tokenizers
with PGO too.
Test environment
I performed tests on my Linux-based machine.
Linux:
- Fedora 39
- Linux kernel 6.6.9
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Compiler - Rustc 1.75
- Tokenizers version: the latest for now from the
main
branch on commitf1c23b868006ee27acdd31796677f82fa10d6bd7
- Disabled Turbo boost (for more stable results across runs)
Benchmarks
As a benchmark, I use built-in benchmarks with cargo bench -- --verbose
command from the Makefile (if you want to reproduce my results - please check #1425 before). For the PGO optimization phase, I use cargo-pgo with cargo pgo optimize bench -- --verbose
. For the PGO training phase, I use the same benchmark with cargo pgo bench -- --verbose
.
Results
I got the following results:
- Release: https://gist.github.com/zamazan4ik/e06dfed470e94bb6e47134b1c58513fb
- PGO-optimized compared to Release: https://gist.github.com/zamazan4ik/5e4b58395d71f5d2a1c2bb27293737ab
- (just for reference) PGO-instrumented compared to Release: https://gist.github.com/zamazan4ik/5440096a8a9b3265b402d9481eab3e10
As you see, in general, the Tokenizers' performance can be improved with PGO. I think this information can be written somewhere into the documentation, so users will be aware of PGO effects on the Tokenizers' performance and can decide to apply PGO for their Tokenizers' builds.
I already see some PGO mentions in the CI scripts but it's not clear - are Tokenizers packages PGO-optimized or not. As far as I can understand from the build scripts - they are not (but I could be wrong - please correct me in this case).
Please treat the issue just as a benchmark report - it's not an actual error, crash, or something like that.