-
Notifications
You must be signed in to change notification settings - Fork 583
Description
Describe the Issue
Rowsplitting experts in 1.100.1 fails with GGML_ASSERT(id >= 0 && id < n_expert). Running the same command w/o rowsplit works fine.
tl;dr I'm trying to split tensors across my 4 GPUs and since overriding tensors with tensor split is broken (#1794) I resorted to a similar method that @LostRuins suggested in the #1794 thread. However this fails with GGML_ASSERT(id >= 0 && id < n_expert) on model load.
CUDA0: P40 (24gb) - tensor_split 50% of the attn/dense layers (--gpulayers maxed) and the first two shared experts (3 and 4)
CUDA1: P40 (24gb) - tensor_split 50% of the attn/dense layers (--gpulayers maxed) and the first two shared experts (5 and 6)
CUDA2: P40 (24gb) - no attn/dense layers, 8 shared experts (7 through 14)
CUDA3: 4060Ti (16gb) - 5 shared experts (15 through 19)
CPU: All remaining experts (20 through 91)
Additional Information:
Windows, E5-2690v4, 320gb RAM, 3xP40, 1x4060TI 16gb GPU
F:\gpt\koboldcpp>"koboldcpp - 1.100.1.exe" --tensor_split 1 1 0 0 --usecublas rowsplit --gpulayers 9999 --contextsize 32768 --overridetensors "blk\.([3-4])\.ffn_(up|gate|down)_exps=CUDA0,blk\.([5-6])\.ffn_(up|gate|down)_exps=CUDA1,blk\.([7-9]|[1][0-4])\.ffn_(up|gate|down)_exps=CUDA2,blk\.([1][5-9])\.ffn_(up|gate|down)_exps=CUDA3,exps=CPU"
***
Welcome to KoboldCpp - Version 1.100.1
For command line arguments, please refer to --help
***
Loading Chat Completions Adapter: C:\Users\{user}\AppData\Local\Temp\_MEI256922\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
System: Windows 10.0.19045 AMD64 Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
Detected Available GPU Memory: 16380 MB
Detected Available RAM: 311691 MB
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(admin=False, admindir='', adminpassword=None, analyze='', benchmark=None, blasbatchsize=512, blasthreads=0, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=32768, debugmode=0, defaultgenamt=768, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel='', embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=0, foreground=False, genlimit=0, gpulayers=9999, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mmproj='', mmprojcpu=False, model=[], model_param='F:/gpt/text-generation-webui/models/Try/GLM4.6/GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv='', overridenativecontext=0, overridetensors='blk\\.([3-4])\\.ffn_(up|gate|down)_exps=CUDA0,blk\\.([5-6])\\.ffn_(up|gate|down)_exps=CUDA1,blk\\.([7-9]|[1][0-4])\\.ffn_(up|gate|down)_exps=CUDA2,blk\\.([1][5-9])\\.ffn_(up|gate|down)_exps=CUDA3,exps=CPU', password=None, port=5001, port_param=5001, preloadstory='', prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile='', sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults='', sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=0, sdtiledvae=768, sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[1.0, 1.0, 0.0, 0.0], threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['rowsplit'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: F:\gpt\text-generation-webui\models\Try\GLM4.6\GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf
The reported GGUF Arch is: glm4moe
Arch Category: 9
---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True
Applying Tensor Split...
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 4 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: no
Device 1: Tesla P40, compute capability 6.1, VMM: no
Device 2: Tesla P40, compute capability 6.1, VMM: no
Device 3: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Handling Override Tensors for backends: CUDA0 CUDA1 CUDA2 CUDA3 CPU
Override Tensor: blk\.([3-4])\.ffn_(up|gate|down)_exps to CUDA0
Override Tensor: blk\.([5-6])\.ffn_(up|gate|down)_exps to CUDA1
Override Tensor: blk\.([7-9]|[1][0-4])\.ffn_(up|gate|down)_exps to CUDA2
Override Tensor: blk\.([1][5-9])\.ffn_(up|gate|down)_exps to CUDA3
Override Tensor: exps to CPU
llama_model_load_from_file_impl: using device CUDA0 (Tesla P40) (0000:02:00.0) - 24319 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Tesla P40) (0000:03:00.0) - 24319 MiB free
llama_model_load_from_file_impl: using device CUDA2 (Tesla P40) (0000:81:00.0) - 24319 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 4060 Ti) (0000:82:00.0) - 15225 MiB free
llama_model_loader: additional 5 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 57 key-value pairs and 1759 tensors from F:\gpt\text-generation-webui\models\Try\GLM4.6\GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 234.74 GiB (5.65 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 151329 ('<|endoftext|>')
load: - 151336 ('<|user|>')
load: - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch = glm4moe
print_info: vocab_only = 0
print_info: n_ctx_train = 202752
print_info: n_embd = 5120
print_info: n_layer = 93
print_info: n_head = 96
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 12
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 160
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 202752
print_info: rope_finetuned = unknown
print_info: model type = 355B.A32B
print_info: model params = 356.79 B
print_info: general.name = Glm-4.6
print_info: vocab type = BPE
print_info: n_vocab = 151552
print_info: n_merges = 318088
print_info: BOS token = 151331 '[gMASK]'
print_info: EOS token = 151329 '<|endoftext|>'
print_info: EOT token = 151336 '<|user|>'
print_info: EOM token = 151338 '<|observation|>'
print_info: UNK token = 151329 '<|endoftext|>'
print_info: PAD token = 151330 '[MASK]'
print_info: LF token = 198 'ÄS'
print_info: FIM PRE token = 151347 '<|code_prefix|>'
print_info: FIM SUF token = 151349 '<|code_suffix|>'
print_info: FIM MID token = 151348 '<|code_middle|>'
print_info: EOG token = 151329 '<|endoftext|>'
print_info: EOG token = 151336 '<|user|>'
print_info: EOG token = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
tensor blk.3.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.3.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA0
tensor blk.3.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.4.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.4.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA0
tensor blk.4.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.5.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.5.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA1
tensor blk.5.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.6.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.6.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA1
tensor blk.6.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.7.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA2
tensor blk.7.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.7.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA2
tensor blk.8.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.8.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.8.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.9.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.9.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.9.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.10.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.10.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.10.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.11.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.11.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.11.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.12.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.12.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.12.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.13.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.13.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.13.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.14.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.14.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.14.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.15.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.15.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.15.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.16.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.16.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA3
tensor blk.16.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.17.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.17.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.17.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.18.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.18.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.18.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.19.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.19.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA3
tensor blk.19.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.20.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.20.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.20.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.21.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.21.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.21.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.22.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.22.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.22.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.23.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.23.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.23.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.24.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.24.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.24.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.25.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.25.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.25.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.26.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.26.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.26.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.27.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.27.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.27.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.28.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.28.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.28.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.29.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.29.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.29.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.30.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.30.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.30.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.31.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.31.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.31.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.32.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.32.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.32.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.33.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.33.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.33.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.34.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.34.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.34.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.35.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.35.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.35.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.36.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.36.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.36.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.37.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.37.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.37.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.38.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.38.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.38.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.39.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.39.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.39.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.40.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.40.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.40.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.41.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.41.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.41.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.42.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.42.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.42.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.43.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.43.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.43.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.44.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.44.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.44.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.45.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.45.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.45.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.46.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.46.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.46.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.47.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.47.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.47.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.48.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.48.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.48.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.49.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.49.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.49.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.50.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.50.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.50.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.51.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.51.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.51.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.52.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.52.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.52.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.53.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.53.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.53.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.54.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.54.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.54.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.55.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.55.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.55.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.56.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.56.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.56.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.57.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.57.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.57.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.58.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.58.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.58.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.59.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.59.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.59.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.60.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.60.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.60.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.61.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.61.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.61.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.62.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.62.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.62.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.63.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.63.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.63.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.64.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.64.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.64.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.65.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.65.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.65.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.66.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.66.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.66.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.67.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.67.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.67.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.68.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.68.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.68.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.69.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.69.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.69.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.70.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.70.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.70.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.71.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.71.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.71.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.72.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.72.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.72.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.73.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.73.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.73.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.74.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.74.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.74.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.75.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.75.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.75.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.76.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.76.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.76.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.77.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.77.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.77.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.78.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.78.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.78.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.79.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.79.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.79.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.80.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.80.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.80.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.81.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.81.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.81.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.82.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.82.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.82.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.83.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.83.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.83.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.84.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.84.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.84.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.85.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.85.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.85.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.86.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.86.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.86.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.87.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.87.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.87.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.88.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.88.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.88.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.89.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.89.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.89.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.90.ffn_gate_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.90.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.90.ffn_up_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.91.ffn_gate_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.91.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.91.ffn_up_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 43253760 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 3604480 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 4300800 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 43253760 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_exps.weight (size = 1032192000 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 36044800 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
load_tensors: relocated tensors: 1002 of 1736
ggml_cuda_host_malloc: failed to allocate 183778.12 MiB of pinned memory: out of memory
load_tensors: offloading 93 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 94/94 layers to GPU
load_tensors: CUDA1_Split model buffer size = 5892.46 MiB
load_tensors: CUDA0_Split model buffer size = 5863.98 MiB
load_tensors: CPU model buffer size = 508.75 MiB
load_tensors: CPU model buffer size = 183778.12 MiB
load_tensors: CUDA0 model buffer size = 4354.48 MiB
load_tensors: CUDA1 model buffer size = 4354.31 MiB
load_tensors: CUDA2 model buffer size = 20137.50 MiB
load_tensors: CUDA3 model buffer size = 12693.75 MiB
..................................................................................load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
..load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
..load_all_data: using async uploads for device CUDA2, buffer type CUDA2, backend CUDA2
........load_all_data: using async uploads for device CUDA3, buffer type CUDA3, backend CUDA3
......
Automatic RoPE Scaling: Using model internal value.
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32896
llama_context: n_ctx_per_seq = 32896
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = disabled
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (32896) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.58 MiB
create_memory: n_ctx = 32896 (padded)
llama_kv_cache: layer 92: does not have KV cache
llama_kv_cache: CUDA0 KV buffer size = 6039.50 MiB
llama_kv_cache: CUDA1 KV buffer size = 5782.50 MiB
llama_kv_cache: size = 11822.00 MiB ( 32896 cells, 92 layers, 1/1 seqs), K (f16): 5911.00 MiB, V (f16): 5911.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 5
llama_context: max_nodes = 13888
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context: CUDA0 compute buffer size = 6314.26 MiB
llama_context: CUDA1 compute buffer size = 6314.26 MiB
llama_context: CUDA2 compute buffer size = 162.33 MiB
llama_context: CUDA3 compute buffer size = 162.33 MiB
llama_context: CUDA_Host compute buffer size = 78.26 MiB
llama_context: graph nodes = 6988
llama_context: graph splits = 294 (with bs=512), 177 (with bs=1)
Threadpool set to 8 threads and 8 blasthreads...
attach_threadpool: call
GLM-4 will have no automatic BOS token.
Starting model warm up, please wait a moment...
D:\a\koboldcpp\koboldcpp\ggml\src\ggml-backend.cpp:1496: GGML_ASSERT(id >= 0 && id < n_expert) failed