Skip to content

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

Closed
@cebtenzzre

Description

@cebtenzzre

I'm making an issue for this to make sure it isn't forgotten about. I've been able to work around this, but it seems like a bug to me.

ref #5631 (comment)

Steps to Reproduce

  1. Download safetensors model from https://huggingface.co/google/gemma-7b
  2. Checkout llama.cpp commit 15499eb (master should reproduce this as well)
  3. ./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16
  4. cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON
  5. make -C build perplexity
  6. Run perplexity on a Tesla P40. Use -ngl 2 or above.
$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,

And there's no point in running it longer than that because the running average will stay NaN.

This also occurs with a model quantized to pure F16 from the official GGUF provided by Google.

BUT, these NaNs do not occur with -ngl 1 or with --no-kv-offload. So it has something to do with offloading of the KV cache.

cc @JohannesGaessler in case you haven't seen this yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions