Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly faster imatrix #5050

Merged
merged 2 commits into from
Jan 21, 2024
Merged

Slightly faster imatrix #5050

merged 2 commits into from
Jan 21, 2024

Conversation

ikawrakow
Copy link
Contributor

The change in PR #5035 applied to imatrix.

Also added --no-ppl command line option to skip computing perplexity altogether.

At the end, it is a minor optimization: On my RTX-4080 time for 100 chunks using Mistral-7B decreases from 47 seconds to 40 seconds.

In comparison, a perplexity calculation for 100 chunks takes 15 seconds. I.e., now most of the time is being spent in running the single-threaded code that collects the importance matrix data on the CPU. When I first added the imatrix tool it only worked on the CPU, so the time for collecting the imatrix was very small compared to the time taken for evaluating the compute graph. I also did not anticipate that people will be using the tools to run imatrix calculations for millions of tokens. So, I guess, it would make sense to parallelize the data gathering as well, but I'm leaving this for a separate PR.

@ikawrakow ikawrakow merged commit 726c0fa into master Jan 21, 2024
44 of 47 checks passed
@ikawrakow ikawrakow deleted the ik/faster_imatrix branch January 21, 2024 06:01
@Nexesenex
Copy link
Contributor

Just an idea, @ikawrakow :
Considering that we are all exploring iMatrix still to find the best compromise, and that you already implemented an autosave every 10 chunks calculated, is it possible to have a backup of the matrix currently calculated in a separate file every given number of chunks crunched via a parameter in command line, with the number of chunks crunched as a suffix (_500, _1000, etc)?
So we get a file every 50, 100, or whatever chunks until the specified amount to reach is reached?

@ikawrakow
Copy link
Contributor Author

@Nexesenex Great idea! See PR #5077

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* imatrix: speedup by avoiding unnecessary allocations and copies

* imatrix: add --no-ppl option to skip PPL calculations altogether

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* imatrix: speedup by avoiding unnecessary allocations and copies

* imatrix: add --no-ppl option to skip PPL calculations altogether

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@bartowski1182
Copy link
Contributor

Hey @ikawrakow Sorry to resurrect an old PR, but I wanted to make sure I understand something

What is the result of adding --no-ppl for imatrix calculation, is it just avoiding outputting the results or does it affect the quality?

it LOOKS largely cosmetic, which is why it surprised me that my mixtral 22x8b imatrix calculation went from 6 hours to 1.5 hours (only 7 layers offloaded)

Just want to make sure i'm not doing something terrible and silly by adding it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants