unknown dtype for tensor (BF16?) #663

oldgithubman · 2024-08-01T17:21:41Z

Describe the bug

My Q8_0 quant of Athene-70B loads fine. I have another quant that is identical except the output and embedding tensors are BF16:

$ RUST_BACKTRACE=full ./mistralrs_server --interactive-mode --num-device-layers 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-01T17:08:17.446889Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-01T17:08:17.446907Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-01T17:08:17.446917Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-01T17:08:17.446981Z  INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/PATH/Athene-70B-Q8_0-BF16.gguf`
2024-08-01T17:08:17.447033Z  WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
Error: path: "/PATH/Athene-70B-Q8_0-BF16.gguf" unknown dtype for tensor 30
   0: candle_core::error::Error::bt
   1: candle_core::quantized::GgmlDType::from_u32
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   4: mistralrs_server::main::{{closure}}
   5: mistralrs_server::main
   6: std::sys_common::backtrace::__rust_begin_short_backtrace
   7: std::rt::lang_start::{{closure}}
   8: std::rt::lang_start_internal
   9: main
  10: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  11: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  12: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: mistralrs_server::main
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
   6: std::rt::lang_start::{{closure}}
   7: std::rt::lang_start_internal
   8: main
   9: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  10: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  11: _start

Latest commit or version

0.2.4

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2024-08-02T01:40:56Z

@oldgithubman yes, this is the problem. Please see huggingface/candle#2387. This will enable support for BF16 and more descriptive errors!

EricLBuehler · 2024-08-17T15:32:10Z

@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.

To test:

git pull
git switch gguf_bf16
<test command here>

oldgithubman · 2024-08-20T04:13:25Z

@oldgithubman given that the Candle PR hasn't been merged, I have mirrored my changes onto our Candle fork so we can proceed. Please see #691, which should enable this to work.

To test:
git pull
git switch gguf_bf16
<test command here>

$ RUST_BACKTRACE=full ./mistralrs_server -i -n 13 --pa-ctxt-len 8192 gguf -m PATH -f Athene-70B-Q8_0-BF16.gguf
2024-08-20T04:09:57.420536Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-20T04:09:57.420558Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-20T04:09:57.420561Z  INFO mistralrs_server: Using flash attention.
2024-08-20T04:09:57.420569Z  WARN mistralrs_server: Using flash attention with a quantized model has no effect!
2024-08-20T04:09:57.420572Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-08-20T04:09:57.430607Z  INFO mistralrs_core::pipeline::paths: Loading `Athene-70B-Q8_0-BF16.gguf` locally at `/media/j/72B264BFB2648A05/Athene-70B-Q8_0-BF16.gguf`
2024-08-20T04:09:57.430858Z  WARN mistralrs_core::pipeline::gguf: Device mapping and PagedAttention are incompatible, disabling PagedAttention.
2024-08-20T04:09:57.655808Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.basename: Athene
general.file_type: 7
general.languages: en
general.license: cc-by-nc-4.0
general.name: Athene 70B
general.organization: Nexusflow
general.quantization_version: 2
general.size_label: 70B
general.tags: RLHF, Nexusflow, Athene, Chat Model
general.type: model
llama.attention.head_count: 64
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 80
llama.context_length: 8192
llama.embedding_length: 8192
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
quantize.imatrix.entries_count: 560
quantize.imatrix.file: FILE
2024-08-20T04:09:57.883996Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-08-20T04:09:57.893896Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}`
Error: quantized type BF16 is not supported yet
   0: candle_core::error::Error::bt
   1: candle_core::quantized::ggml_file::qtensor_from_ggml
   2: candle_core::quantized::gguf_file::Content::tensor
   3: <mistralrs_core::models::quantized_llama::ModelWeights as mistralrs_core::utils::model_config::FromGGUF>::from_gguf
   4: mistralrs_core::utils::model_config::<impl core::convert::TryFrom<mistralrs_core::utils::model_config::ModelParams<mistralrs_core::utils::model_config::ParamsGGUF>> for mistralrs_core::models::quantized_llama::ModelWeights>::try_from
   5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   6: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   7: mistralrs_server::main::{{closure}}
   8: mistralrs_server::main
   9: std::sys_common::backtrace::__rust_begin_short_backtrace
  10: std::rt::lang_start::{{closure}}
  11: std::rt::lang_start_internal
  12: main
  13: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  14: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  15: _start


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
   2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
   3: mistralrs_server::main::{{closure}}
   4: mistralrs_server::main
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
   6: std::rt::lang_start::{{closure}}
   7: std::rt::lang_start_internal
   8: main
   9: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  10: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  11: _start

EricLBuehler · 2024-08-20T10:38:40Z

@oldgithubman thanks, that should be fixed now if you git pull again and retry!

oldgithubman · 2024-08-20T18:11:54Z

@oldgithubman thanks, that should be fixed now if you git pull again and retry!

ERROR mistralrs_core::engine: prompt step - Model failed with error: WithBacktrace { inner: Msg("unsupported dtype for quantized matmul BF16"), backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "<candle_core::quantized::QMatMul as candle_core::Module>::forward" }, { fn: "<mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward" }, { fn: "mistralrs_core::models::quantized_llama::ModelWeights::forward" }, { fn: "<mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }
2024-08-20T18:09:40.158461Z ERROR mistralrs_server::interactive_mode: Got a model error: "unsupported dtype for quantized matmul BF16\n   0: candle_core::error::Error::bt\n   1: <candle_core::quantized::QMatMul as candle_core::Module>::forward\n   2: <mistralrs_quant::gguf::GgufMatMul as mistralrs_quant::QuantMethod>::forward\n   3: mistralrs_core::models::quantized_llama::ModelWeights::forward\n   4: <mistralrs_core::pipeline::gguf::GGUFPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs\n   5: mistralrs_core::pipeline::Pipeline::step::{{closure}}\n   6: mistralrs_core::engine::Engine::run::{{closure}}\n   7: std::sys_common::backtrace::__rust_begin_short_backtrace\n   8: core::ops::function::FnOnce::call_once{{vtable.shim}}\n   9: std::sys::pal::unix::thread::Thread::new::thread_start\n  10: start_thread\n             at ./nptl/pthread_create.c:442:8\n  11: __GI___clone3\n             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81\n", response: ChatCompletionResponse { id: "0", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1724177337, model: "PATH", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 48, total_tokens: 48, avg_tok_per_sec: 1.1136891, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 43.1, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

EricLBuehler · 2024-08-20T18:17:42Z

@oldgithubman can you please run with RUST_BACKTRACE=1?

oldgithubman · 2024-08-20T23:11:51Z

@oldgithubman can you please run with RUST_BACKTRACE=1?

that was run with RUST_BACKTRACE=full. Do you still want me to do it with 1?

EricLBuehler · 2024-08-20T23:12:41Z

Ah ok thanks, I'll take a look.

EricLBuehler · 2024-08-20T23:26:43Z

@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).

oldgithubman · 2024-08-21T04:44:17Z

@oldgithubman I just updated the branch to correctly setup the QMatMul (#691).

Works!

EricLBuehler · 2024-08-21T13:32:52Z

@oldgithubman thanks for confirming! I just merged #691, so this feature is available on master and will be in 0.2.6 in a few days.

EricLBuehler · 2024-09-01T15:23:38Z

@oldgithubman closing this issue as it works, please feel free to reopen!

oldgithubman added the bug Something isn't working label Aug 1, 2024

EricLBuehler closed this as completed Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unknown dtype for tensor (BF16?) #663

unknown dtype for tensor (BF16?) #663

oldgithubman commented Aug 1, 2024

EricLBuehler commented Aug 2, 2024 •

edited

Loading

EricLBuehler commented Aug 17, 2024

oldgithubman commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

oldgithubman commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

oldgithubman commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

oldgithubman commented Aug 21, 2024

EricLBuehler commented Aug 21, 2024

EricLBuehler commented Sep 1, 2024

unknown dtype for tensor (BF16?) #663

unknown dtype for tensor (BF16?) #663

Comments

oldgithubman commented Aug 1, 2024

Describe the bug

Latest commit or version

EricLBuehler commented Aug 2, 2024 • edited Loading

EricLBuehler commented Aug 17, 2024

oldgithubman commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

oldgithubman commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

oldgithubman commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

EricLBuehler commented Aug 20, 2024

oldgithubman commented Aug 21, 2024

EricLBuehler commented Aug 21, 2024

EricLBuehler commented Sep 1, 2024

EricLBuehler commented Aug 2, 2024 •

edited

Loading