Skip to content

Unhandled Exception trying to use Vulkan  #644

@Thellton

Description

@Thellton

I'm not sure if this is the correct place to post this issue as it could be an upstream issue but here's hoping

Hardware used

CPU: Ryzen 5 5600G
GPU: RX6600XT (Driver Version: 23.30.13.01-231128a-398226C-AMD-Software-Adrenalin-Edition)
RAM: 47.9GB of DDR4 at 2133MHz
Motherboard: Gigabyte B450M Aorus Elite

Hopefully the above is useful? but the below should absolutely be useful. I tried with a full offload as below using my normal specification of 41 layers, tried a second time with only 10 layers specified, the tried a third time with 33 layers specified ie equal to that of the actual model's number of layers. something odd I did notice is that
 

CMD output

...\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=41, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:     Vulkan buffer size =  3877.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.08 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[14452] Failed to execute script 'koboldcpp' due to unhandled exception!

D:\AI-Art-tools\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=10, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3947.87 MiB
llm_load_tensors:     Vulkan buffer size =  1170.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size =  1408.00 MiB
llama_kv_cache_init:     Vulkan KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.08 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[10284] Failed to execute script 'koboldcpp' due to unhandled exception!

D:\AI-Art-tools\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=33, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:     Vulkan buffer size =  3877.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.08 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[8336] Failed to execute script 'koboldcpp' due to unhandled exception!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions