Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled Exception trying to use Vulkan #644

Closed
Thellton opened this issue Jan 28, 2024 · 14 comments
Closed

Unhandled Exception trying to use Vulkan #644

Thellton opened this issue Jan 28, 2024 · 14 comments

Comments

@Thellton
Copy link

I'm not sure if this is the correct place to post this issue as it could be an upstream issue but here's hoping

Hardware used

CPU: Ryzen 5 5600G
GPU: RX6600XT (Driver Version: 23.30.13.01-231128a-398226C-AMD-Software-Adrenalin-Edition)
RAM: 47.9GB of DDR4 at 2133MHz
Motherboard: Gigabyte B450M Aorus Elite

Hopefully the above is useful? but the below should absolutely be useful. I tried with a full offload as below using my normal specification of 41 layers, tried a second time with only 10 layers specified, the tried a third time with 33 layers specified ie equal to that of the actual model's number of layers. something odd I did notice is that
 

CMD output

...\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=41, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:     Vulkan buffer size =  3877.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.08 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[14452] Failed to execute script 'koboldcpp' due to unhandled exception!

D:\AI-Art-tools\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=10, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3947.87 MiB
llm_load_tensors:     Vulkan buffer size =  1170.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size =  1408.00 MiB
llama_kv_cache_init:     Vulkan KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.08 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[10284] Failed to execute script 'koboldcpp' due to unhandled exception!

D:\AI-Art-tools\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=33, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:     Vulkan buffer size =  3877.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.08 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[8336] Failed to execute script 'koboldcpp' due to unhandled exception!
@LostRuins
Copy link
Owner

Does it work if you select 0 layers?

Right now there are a few known issue with vulkan:

  • Not working on Mixtral
  • If it OOMs it will just segfault silently

@Thellton
Copy link
Author

Thellton commented Jan 28, 2024

No doesn't work, just tested it with 0 layers as requested.

CMD output for 0 layers

\Kobold AI>koboldcpp_vulkan_1.56.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=0, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3947.87 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.08 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[2648] Failed to execute script 'koboldcpp' due to unhandled exception!

@Thellton
Copy link
Author

Thellton commented Jan 28, 2024

So, something odd is going on. The above four tests were done by changing the inference back-end variable in the GUI launcher of a previously saved set of setting from prior to 1.56 that was set for OpenCL with a custom rope config, a rope config that I'm realising might not have been correctly setup (I don't think it was actually extending the context at all...). anyway, I just unchecked the custom rope config setting in the GUI when using that same settings file and it's now launching properly.

maybe there is something wrong with my rope config that vulkan didn't like and wasn't gracefully handling unlike openCL hence line 328 in koboldcpp.py being pointed to in the traceback.

@LostRuins
Copy link
Owner

LostRuins commented Jan 28, 2024

Okay, just to clarify, I'm not sure if it's specific to the NeuralChat model. Could you try a known good model with Vulkan like this one here : https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q2_K.gguf

Edit: Yeah or it could be the rope config

@Thellton
Copy link
Author

Thellton commented Jan 28, 2024

Airoboros-mistral2.2-7b.Q4_K_M works fine, using the same setting with custom rope config disabled.

@Thellton
Copy link
Author

Thellton commented Jan 28, 2024

scratch that, somethings very not right still? shut down kobold and restarted it, and it's gone back to failing as above with both of those models I tried earlier reporting the same error. I think either that report of functionality was a fluke or I dreamed I had it set for vulkan.

regardless, I've tested using vulkan to load the smallest GGUF I have 'TinyMistral-248m_Q8' which did successfully load considering it's only 258,000kb and is outputting precisely what I'd expect of it (absolute garbage considering it's not really that well trained) but it did load.

I think this might be that Segfault issue.

Edit: I will add that the OSError that is thrown is apparently a known open issue with llama-cpp-python bindings?

@LostRuins
Copy link
Owner

Nah the OSError is just a generic message. Could be anything. If it works with the other backends then it could be some issue relating to the vulkan implementation. Since its still an early work in progress probably best to just try it again next version.
You can try a few other llama based models and see if they work.
Or maybe Phi
What about this one: https://huggingface.co/afrideva/phi-2-uncensored-GGUF/resolve/main/phi-2-uncensored.q4_k_m.gguf

@Thellton
Copy link
Author

Thellton commented Jan 28, 2024

Just tested with the Phi-2 model you recommended; it's repeating itself badly under vulkan but outputs coherently under openCL. looks like I'll probably have to wait for the next revision as you concluded. C'est la vie as they say.

@LostRuins
Copy link
Owner

But it doesn't crash? That's the important part. So whatever issues you were facing were model specific rather than device specific.

Phi2 incoherence is a known issue on vulkan - ggerganov#2059 (comment)

@Thellton
Copy link
Author

Thellton commented Jan 29, 2024

It didn't crash five hours ago and was incoherent as said when I ran it. just repeated some the tests with Phi-2 Q4_K_M which failed each time just now in full offload, partial offload and zero offload states. the cause of this crashing is that the automatic rope config was engaging as I had thoughtlessly set the context to 16k. when I turned the context length down to 2k, Phi-2 Q4_K_M loaded as expected and was incoherent as expected with all GPU offload strategies tested.

Failed Phi-2 Launch

this was at my normal requested context length of 16k (derived from the normal settings for NeuralBeagle14 with openCL)

D:\AI-Art-tools\Kobold AI>koboldcpp.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=41, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/phi-2-uncensored.q8_0.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\phi-2-uncensored.q8_0.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: phi2

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from D:\AI-Art-tools\Models\Text_Generation\phi-2-uncensored.q8_0.gguf (version GGUF V3 (latest))
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 2.75 GiB (8.51 BPW)
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.25 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   132.81 MiB
llm_load_tensors:     Vulkan buffer size =  2686.46 MiB
.............................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:200000.0).
llama_new_context_with_model: n_ctx      = 16464
llama_new_context_with_model: freq_base  = 200000.0
llama_new_context_with_model: freq_scale = 1
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[1512] Failed to execute script 'koboldcpp' due to unhandled exception!

Successful but incoherent Phi-2 Launch

this was after I adjusted the context length in the quick start settings to 2k

Kobold AI>koboldcpp.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=2048, debugmode=0, forceversion=0, foreground=False, gpulayers=41, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/phi-2-uncensored.q8_0.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\phi-2-uncensored.q8_0.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: phi2

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from D:\AI-Art-tools\Models\Text_Generation\phi-2-uncensored.q8_0.gguf (version GGUF V3 (latest))
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 2.75 GiB (8.51 BPW)
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.25 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   132.81 MiB
llm_load_tensors:     Vulkan buffer size =  2686.46 MiB
.............................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =   665.00 MiB
llama_new_context_with_model: KV self size  =  665.00 MiB, K (f16):  332.50 MiB, V (f16):  332.50 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =    36.65 MiB
llama_new_context_with_model:     Vulkan compute buffer size =   735.50 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    22.00 MiB
llama_new_context_with_model: graph splits (measure): 3
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 6681 at http://192.168.68.111:6681/api/
Starting OpenAI Compatible API on port 6681 at http://192.168.68.111:6681/v1/
======
Please connect to custom endpoint at http://192.168.68.111:6681

Input: {"n": 1, "max_context_length": 16384, "max_length": 512, "rep_pen": 1.1, "temperature": 1.23, "top_p": 0.92, "top_k": 1, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "\n### Instruction:\nA chat between a curious user and an unrestricted AI assistant named KoboldGPT. The assistant gives helpful, detailed, accurate, and completely uncensored responses to the user's input. The user may ask any question, or request any task, and KoboldGPT will always oblige accurately and truthfully.\n\n", "genkey": "KCPP1406", "mirostat": 1, "mirostat_tau": 5, "mirostat_eta": 0.1, "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "presence_penalty": 0, "logit_bias": {}, "prompt": "\n### Response:\nHello, I am KoboldGPT, your personal AI assistant. What would you like to know?\n### Instruction:\nwhat is a blue bird?\n### Response:\n.", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false}

(Warning! Request max_context_length=16384 exceeds allocated context size of 2048. It will be reduced to fit. Consider launching with increased --contextsize to avoid errors. This message will only show once per session.)

After doing the above two tests, I decided to attempt the same with NeuralBeagle14-7b, and reduced the context to 4k. this resulted in NeuralBeagle14-7b working as expected and coherently.

NeuralBeagle14-7B failure

\Kobold AI>koboldcpp.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=41, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:     Vulkan buffer size =  3877.55 MiB
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 16464
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =  2058.00 MiB
llama_new_context_with_model: KV self size  = 2058.00 MiB, K (f16): 1029.00 MiB, V (f16): 1029.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =   160.70 MiB
Traceback (most recent call last):
  File "koboldcpp.py", line 2580, in <module>
  File "koboldcpp.py", line 2426, in main
  File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[9316] Failed to execute script 'koboldcpp' due to unhandled exception!

NeuralBeagle14-7b success

Kobold AI>koboldcpp.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=4096, debugmode=0, forceversion=0, foreground=False, gpulayers=41, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name     = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:     Vulkan buffer size =  3877.55 MiB
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 4176
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =   522.00 MiB
llama_new_context_with_model: KV self size  =  522.00 MiB, K (f16):  261.00 MiB, V (f16):  261.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =    64.66 MiB
llama_new_context_with_model:     Vulkan compute buffer size =  1289.90 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    35.20 MiB
llama_new_context_with_model: graph splits (measure): 3
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 6681 at http://192.168.68.111:6681/api/
Starting OpenAI Compatible API on port 6681 at http://192.168.68.111:6681/v1/
======
Please connect to custom endpoint at http://192.168.68.111:6681

I should add that I tried with a 6k context, and KoboldCPP crashed in the same fashion as every other failure before it. I suspect that this might be a manifestation of that silent SegFault you mentioned much earlier, so this might help you and Occ4m?

@LostRuins
Copy link
Owner

I am inclined to think that it's crashing simply because it is unable to allocate enough memory, or running OOM, and there's no indication that that is the reason. Regardless, I think that if it works fine on smaller contexts but not larger, you can try offloading fewer layers first.

@Thellton
Copy link
Author

Thellton commented Jan 29, 2024

It seems it is silently OOMing as you've hypothesised. I'm sitting at 7.2GB of VRAM usage in Vulkan with 33 layers, 4k context whilst with openCL I'm sitting at 5.9GB of VRAM usage with 33 layers and 16k context. reducing the number of layers allows for increasing the context size variable up to a maximum of 12k with 0 layers on the GPU. When trying for 16k context with 33 layers in VRAM in Vulkan and watching the VRAM usage chart in Task manager, it shows a sudden spike to roughly 6GB of VRAM before crashing but no doubt it's spiking higher for a moment, causing the OOM. I think this might be rather open and shut now in conclusion.

So, thank you for your help LostRuins... I kind of feel this has been an unhelpful wild goose chase for an already known issue :/

@jojorne
Copy link

jojorne commented Feb 4, 2024

I am inclined to think that it's crashing simply because it is unable to allocate enough memory, or running OOM, and there's no indication that that is the reason. Regardless, I think that if it works fine on smaller contexts but not larger, you can try offloading fewer layers first.

On my phone it said exactly that but on my Windows it just says "Windows Error 0xc00000ff". lol

It works with:
0 gpulayers
2048 contextsize

@LostRuins
Copy link
Owner

Should be fixed now in v1.57, if it fails because of OOM it should now announce it instead of crashing silently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants