cuda: NaN perplexity with some models on some GPUs (Gemma, MPT)

I'm making an issue for this to make sure it isn't forgotten about. I've been able to work around this, but it seems like a bug to me.

ref https://github.com/ggerganov/llama.cpp/pull/5631#issuecomment-1961613111

### Steps to Reproduce
1. Download safetensors model from https://huggingface.co/google/gemma-7b
2. Checkout llama.cpp commit 15499eb94 (master should reproduce this as well)
3. `./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16`
4. `cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON`
5. `make -C build perplexity`
6. Run perplexity on a Tesla P40. Use `-ngl 2` or above.
```
$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,
```
And there's no point in running it longer than that because the running average will stay NaN.

This also occurs with a model quantized to pure F16 from the official GGUF provided by Google.

BUT, these NaNs do not occur with -ngl 1 or with --no-kv-offload. So it has something to do with offloading of the KV cache.

cc @JohannesGaessler in case you haven't seen this yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

Steps to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

Description

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions