Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The change in PR #5035 applied to
imatrix
.Also added
--no-ppl
command line option to skip computing perplexity altogether.At the end, it is a minor optimization: On my RTX-4080 time for 100 chunks using Mistral-7B decreases from 47 seconds to 40 seconds.
In comparison, a perplexity calculation for 100 chunks takes 15 seconds. I.e., now most of the time is being spent in running the single-threaded code that collects the importance matrix data on the CPU. When I first added the
imatrix
tool it only worked on the CPU, so the time for collecting the imatrix was very small compared to the time taken for evaluating the compute graph. I also did not anticipate that people will be using the tools to runimatrix
calculations for millions of tokens. So, I guess, it would make sense to parallelize the data gathering as well, but I'm leaving this for a separate PR.