-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perplexity: avoid unnecessary alloocations and logit copies #5035
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Btw, on M2 Ultra I don't see much difference:
make -j perplexity && ./perplexity -m models/llama-7b-v2/ggml-model-f16.gguf -f ./build/wikitext-2-raw/wiki.test.raw --chunks 64
# master
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 524.672 ms
perplexity: calculating perplexity over 64 chunks, batch_size=512
perplexity: 0.38 seconds per pass - ETA 0.40 minutes
[1]4.1672,[2]4.6879,[3]5.3354,[4]5.9055,[5]6.0324,[6]5.9499,[7]6.1214,[8]6.2105,[9]6.5348,[10]6.7147,[11]6.9313,[12]6.9794,[13]6.9035,[14]6.9786,[15]7.2016,[16]6.8633,[17]6.7471,[18]6.7377,[19]6.4191,[20]6.4129,[21]6.3387,[22]6.1702,[23]6.1408,[24]6.0507,[25]6.0387,[26]5.8833,[27]5.7017,[28]5.6024,[29]5.5209,[30]5.3714,[31]5.3338,[32]5.3533,[33]5.3097,[34]5.3386,[35]5.3543,[36]5.3790,[37]5.3737,[38]5.3724,[39]5.3859,[40]5.4353,[41]5.4569,[42]5.4941,[43]5.4564,[44]5.5108,[45]5.5202,[46]5.4975,[47]5.5212,[48]5.5007,[49]5.5010,[50]5.4683,[51]5.4688,[52]5.4591,[53]5.5063,[54]5.4939,[55]5.4793,[56]5.5086,[57]5.5277,[58]5.5556,[59]5.5769,[60]5.6259,[61]5.6227,[62]5.6839,[63]5.7190,[64]5.7258,
Final estimate: PPL = 5.7258 +/- 0.10085
llama_print_timings: load time = 435.30 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 23032.03 ms / 32768 tokens ( 0.70 ms per token, 1422.71 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 23781.72 ms / 32769 tokens
# PR
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 526.708 ms
perplexity: calculating perplexity over 64 chunks, batch_size=512
perplexity: 0.37 seconds per pass - ETA 0.38 minutes
[1]4.1672,[2]4.6879,[3]5.3354,[4]5.9055,[5]6.0324,[6]5.9499,[7]6.1214,[8]6.2105,[9]6.5348,[10]6.7147,[11]6.9313,[12]6.9794,[13]6.9035,[14]6.9786,[15]7.2016,[16]6.8633,[17]6.7471,[18]6.7377,[19]6.4191,[20]6.4129,[21]6.3387,[22]6.1702,[23]6.1408,[24]6.0507,[25]6.0387,[26]5.8833,[27]5.7017,[28]5.6024,[29]5.5209,[30]5.3714,[31]5.3338,[32]5.3533,[33]5.3097,[34]5.3386,[35]5.3543,[36]5.3790,[37]5.3737,[38]5.3724,[39]5.3859,[40]5.4353,[41]5.4569,[42]5.4941,[43]5.4564,[44]5.5108,[45]5.5202,[46]5.4975,[47]5.5212,[48]5.5007,[49]5.5010,[50]5.4683,[51]5.4688,[52]5.4591,[53]5.5063,[54]5.4939,[55]5.4793,[56]5.5086,[57]5.5277,[58]5.5556,[59]5.5769,[60]5.6259,[61]5.6227,[62]5.6839,[63]5.7190,[64]5.7258,
Final estimate: PPL = 5.7258 +/- 0.10085
llama_print_timings: load time = 433.06 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 23046.12 ms / 32768 tokens ( 0.70 ms per token, 1421.84 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 23701.89 ms / 32769 tokens
Nevermind - I was looking at the wrong timings 👍 |
Considering imatrix shares a lot of code with perplexity, could a similar optimisation also apply to the former? |
Good point. I'll do it in a bit. |
…v#5035) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
…v#5035) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This speeds up perplexity calculations by a large margin. E.g., for computing perplexity of
wiki.test.raw
forfp16
Mistral-7B on RTX-4080 and 32-core Ryzen 5975WX CPU