Closed
Description
I'm making an issue for this to make sure it isn't forgotten about. I've been able to work around this, but it seems like a bug to me.
ref #5631 (comment)
Steps to Reproduce
- Download safetensors model from https://huggingface.co/google/gemma-7b
- Checkout llama.cpp commit 15499eb (master should reproduce this as well)
./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON
make -C build perplexity
- Run perplexity on a Tesla P40. Use
-ngl 2
or above.
$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,
And there's no point in running it longer than that because the running average will stay NaN.
This also occurs with a model quantized to pure F16 from the official GGUF provided by Google.
BUT, these NaNs do not occur with -ngl 1 or with --no-kv-offload. So it has something to do with offloading of the KV cache.
cc @JohannesGaessler in case you haven't seen this yet.