forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 559
Closed
Description
I'm not sure if this is the correct place to post this issue as it could be an upstream issue but here's hoping
Hardware used
CPU: Ryzen 5 5600G
GPU: RX6600XT (Driver Version: 23.30.13.01-231128a-398226C-AMD-Software-Adrenalin-Edition)
RAM: 47.9GB of DDR4 at 2133MHz
Motherboard: Gigabyte B450M Aorus Elite
Hopefully the above is useful? but the below should absolutely be useful. I tried with a full offload as below using my normal specification of 41 layers, tried a second time with only 10 layers specified, the tried a third time with 33 layers specified ie equal to that of the actual model's number of layers. something odd I did notice is that
CMD output
...\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=41, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]
The reported GGUF Arch is: llama
---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Vulkan buffer size = 3877.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: freq_base = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size = 160.08 MiB
Traceback (most recent call last):
File "koboldcpp.py", line 2580, in <module>
File "koboldcpp.py", line 2426, in main
File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[14452] Failed to execute script 'koboldcpp' due to unhandled exception!
D:\AI-Art-tools\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=10, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]
The reported GGUF Arch is: llama
---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors: CPU buffer size = 3947.87 MiB
llm_load_tensors: Vulkan buffer size = 1170.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: freq_base = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size = 1408.00 MiB
llama_kv_cache_init: Vulkan KV buffer size = 640.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size = 160.08 MiB
Traceback (most recent call last):
File "koboldcpp.py", line 2580, in <module>
File "koboldcpp.py", line 2426, in main
File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[10284] Failed to execute script 'koboldcpp' due to unhandled exception!
D:\AI-Art-tools\Kobold AI>koboldcpp_vulkan.exe
***
Welcome to KoboldCpp - Version 1.56
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.HIGH_PRIORITY_CLASS
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=5, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=33, highpriority=True, hordeconfig=None, host='192.168.68.111', launch=True, lora=None, model=None, model_param='D:/AI-Art-tools/Models/Text_Generation/neuralbeagle14-7b.Q4_K_S.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=6681, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[1.0, 32000.0], skiplauncher=False, smartcontext=True, ssl=None, tensor_split=None, threads=5, useclblast=None, usecublas=None, usemlock=False, usevulkan=0)
==========
Loading model: D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf
[Threads: 5, BlasThreads: 5, SmartContext: True, ContextShift: False]
The reported GGUF Arch is: llama
---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:32000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_vulkan: Using AMD Radeon RX 6600 XT
ggml_vulkan: 16-bit enabled
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from D:\AI-Art-tools\Models\Text_Generation\neuralbeagle14-7b.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name = mlabonne_neuralbeagle14-7b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Vulkan buffer size = 3877.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: freq_base = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size = 160.08 MiB
Traceback (most recent call last):
File "koboldcpp.py", line 2580, in <module>
File "koboldcpp.py", line 2426, in main
File "koboldcpp.py", line 328, in load_model
OSError: [WinError -1073741569] Windows Error 0xc00000ff
[8336] Failed to execute script 'koboldcpp' due to unhandled exception!
Metadata
Metadata
Assignees
Labels
No labels