Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perplexity: avoid unnecessary alloocations and logit copies #5035

Merged
merged 1 commit into from
Jan 19, 2024

Conversation

ikawrakow
Copy link
Contributor

This speeds up perplexity calculations by a large margin. E.g., for computing perplexity of wiki.test.raw for fp16 Mistral-7B on RTX-4080 and 32-core Ryzen 5975WX CPU

  • Time goes down from 70 seconds to 48 seconds for context of 512 (46% speedup)
  • Time goes down from 114 seconds to 76 seconds for context of 4096 (49% speedup)

@ikawrakow ikawrakow added the performance Speed related topics label Jan 19, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Btw, on M2 Ultra I don't see much difference:

make -j perplexity && ./perplexity -m models/llama-7b-v2/ggml-model-f16.gguf -f ./build/wikitext-2-raw/wiki.test.raw --chunks 64
# master
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 524.672 ms
perplexity: calculating perplexity over 64 chunks, batch_size=512
perplexity: 0.38 seconds per pass - ETA 0.40 minutes
[1]4.1672,[2]4.6879,[3]5.3354,[4]5.9055,[5]6.0324,[6]5.9499,[7]6.1214,[8]6.2105,[9]6.5348,[10]6.7147,[11]6.9313,[12]6.9794,[13]6.9035,[14]6.9786,[15]7.2016,[16]6.8633,[17]6.7471,[18]6.7377,[19]6.4191,[20]6.4129,[21]6.3387,[22]6.1702,[23]6.1408,[24]6.0507,[25]6.0387,[26]5.8833,[27]5.7017,[28]5.6024,[29]5.5209,[30]5.3714,[31]5.3338,[32]5.3533,[33]5.3097,[34]5.3386,[35]5.3543,[36]5.3790,[37]5.3737,[38]5.3724,[39]5.3859,[40]5.4353,[41]5.4569,[42]5.4941,[43]5.4564,[44]5.5108,[45]5.5202,[46]5.4975,[47]5.5212,[48]5.5007,[49]5.5010,[50]5.4683,[51]5.4688,[52]5.4591,[53]5.5063,[54]5.4939,[55]5.4793,[56]5.5086,[57]5.5277,[58]5.5556,[59]5.5769,[60]5.6259,[61]5.6227,[62]5.6839,[63]5.7190,[64]5.7258,
Final estimate: PPL = 5.7258 +/- 0.10085

llama_print_timings:        load time =     435.30 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   23032.03 ms / 32768 tokens (    0.70 ms per token,  1422.71 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   23781.72 ms / 32769 tokens
# PR
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 526.708 ms
perplexity: calculating perplexity over 64 chunks, batch_size=512
perplexity: 0.37 seconds per pass - ETA 0.38 minutes
[1]4.1672,[2]4.6879,[3]5.3354,[4]5.9055,[5]6.0324,[6]5.9499,[7]6.1214,[8]6.2105,[9]6.5348,[10]6.7147,[11]6.9313,[12]6.9794,[13]6.9035,[14]6.9786,[15]7.2016,[16]6.8633,[17]6.7471,[18]6.7377,[19]6.4191,[20]6.4129,[21]6.3387,[22]6.1702,[23]6.1408,[24]6.0507,[25]6.0387,[26]5.8833,[27]5.7017,[28]5.6024,[29]5.5209,[30]5.3714,[31]5.3338,[32]5.3533,[33]5.3097,[34]5.3386,[35]5.3543,[36]5.3790,[37]5.3737,[38]5.3724,[39]5.3859,[40]5.4353,[41]5.4569,[42]5.4941,[43]5.4564,[44]5.5108,[45]5.5202,[46]5.4975,[47]5.5212,[48]5.5007,[49]5.5010,[50]5.4683,[51]5.4688,[52]5.4591,[53]5.5063,[54]5.4939,[55]5.4793,[56]5.5086,[57]5.5277,[58]5.5556,[59]5.5769,[60]5.6259,[61]5.6227,[62]5.6839,[63]5.7190,[64]5.7258,
Final estimate: PPL = 5.7258 +/- 0.10085

llama_print_timings:        load time =     433.06 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   23046.12 ms / 32768 tokens (    0.70 ms per token,  1421.84 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   23701.89 ms / 32769 tokens

@ggerganov
Copy link
Owner

Nevermind - I was looking at the wrong timings 👍

@ikawrakow ikawrakow merged commit 993fba8 into master Jan 19, 2024
39 of 47 checks passed
@ikawrakow ikawrakow deleted the ik/faster_ppl branch January 19, 2024 09:02
@Artefact2
Copy link
Collaborator

Considering imatrix shares a lot of code with perplexity, could a similar optimisation also apply to the former?

@ikawrakow
Copy link
Contributor Author

Considering imatrix shares a lot of code with perplexity, could a similar optimisation also apply to the former?

Good point. I'll do it in a bit.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
…v#5035)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…v#5035)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants