Skip to content

GGML_ASSERT(id >= 0 && id < n_expert) failed when using rowsplit across experts #1821

@gpez-git

Description

@gpez-git

Describe the Issue
Rowsplitting experts in 1.100.1 fails with GGML_ASSERT(id >= 0 && id < n_expert). Running the same command w/o rowsplit works fine.

tl;dr I'm trying to split tensors across my 4 GPUs and since overriding tensors with tensor split is broken (#1794) I resorted to a similar method that @LostRuins suggested in the #1794 thread. However this fails with GGML_ASSERT(id >= 0 && id < n_expert) on model load.

CUDA0: P40 (24gb) - tensor_split 50% of the attn/dense layers (--gpulayers maxed) and the first two shared experts (3 and 4)
CUDA1: P40 (24gb) - tensor_split 50% of the attn/dense layers (--gpulayers maxed) and the first two shared experts (5 and 6)
CUDA2: P40 (24gb) - no attn/dense layers, 8 shared experts (7 through 14)
CUDA3: 4060Ti (16gb) - 5 shared experts (15 through 19)
CPU: All remaining experts (20 through 91)

Additional Information:
Windows, E5-2690v4, 320gb RAM, 3xP40, 1x4060TI 16gb GPU

F:\gpt\koboldcpp>"koboldcpp - 1.100.1.exe" --tensor_split 1 1 0 0 --usecublas rowsplit --gpulayers 9999 --contextsize 32768 --overridetensors "blk\.([3-4])\.ffn_(up|gate|down)_exps=CUDA0,blk\.([5-6])\.ffn_(up|gate|down)_exps=CUDA1,blk\.([7-9]|[1][0-4])\.ffn_(up|gate|down)_exps=CUDA2,blk\.([1][5-9])\.ffn_(up|gate|down)_exps=CUDA3,exps=CPU"
***
Welcome to KoboldCpp - Version 1.100.1
For command line arguments, please refer to --help
***
Loading Chat Completions Adapter: C:\Users\{user}\AppData\Local\Temp\_MEI256922\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
System: Windows 10.0.19045 AMD64 Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
Detected Available GPU Memory: 16380 MB
Detected Available RAM: 311691 MB
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(admin=False, admindir='', adminpassword=None, analyze='', benchmark=None, blasbatchsize=512, blasthreads=0, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=32768, debugmode=0, defaultgenamt=768, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel='', embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=0, foreground=False, genlimit=0, gpulayers=9999, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mmproj='', mmprojcpu=False, model=[], model_param='F:/gpt/text-generation-webui/models/Try/GLM4.6/GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv='', overridenativecontext=0, overridetensors='blk\\.([3-4])\\.ffn_(up|gate|down)_exps=CUDA0,blk\\.([5-6])\\.ffn_(up|gate|down)_exps=CUDA1,blk\\.([7-9]|[1][0-4])\\.ffn_(up|gate|down)_exps=CUDA2,blk\\.([1][5-9])\\.ffn_(up|gate|down)_exps=CUDA3,exps=CPU', password=None, port=5001, port_param=5001, preloadstory='', prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile='', sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults='', sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=0, sdtiledvae=768, sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[1.0, 1.0, 0.0, 0.0], threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['rowsplit'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: F:\gpt\text-generation-webui\models\Try\GLM4.6\GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf

The reported GGUF Arch is: glm4moe
Arch Category: 9

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True

Applying Tensor Split...
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: no
  Device 1: Tesla P40, compute capability 6.1, VMM: no
  Device 2: Tesla P40, compute capability 6.1, VMM: no
  Device 3: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Handling Override Tensors for backends: CUDA0 CUDA1 CUDA2 CUDA3 CPU

Override Tensor: blk\.([3-4])\.ffn_(up|gate|down)_exps to CUDA0
Override Tensor: blk\.([5-6])\.ffn_(up|gate|down)_exps to CUDA1
Override Tensor: blk\.([7-9]|[1][0-4])\.ffn_(up|gate|down)_exps to CUDA2
Override Tensor: blk\.([1][5-9])\.ffn_(up|gate|down)_exps to CUDA3
Override Tensor: exps to CPU
llama_model_load_from_file_impl: using device CUDA0 (Tesla P40) (0000:02:00.0) - 24319 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Tesla P40) (0000:03:00.0) - 24319 MiB free
llama_model_load_from_file_impl: using device CUDA2 (Tesla P40) (0000:81:00.0) - 24319 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 4060 Ti) (0000:82:00.0) - 15225 MiB free
llama_model_loader: additional 5 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 57 key-value pairs and 1759 tensors from F:\gpt\text-generation-webui\models\Try\GLM4.6\GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 234.74 GiB (5.65 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load:   - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch             = glm4moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 202752
print_info: n_embd           = 5120
print_info: n_layer          = 93
print_info: n_head           = 96
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 12
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 160
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 202752
print_info: rope_finetuned   = unknown
print_info: model type       = 355B.A32B
print_info: model params     = 356.79 B
print_info: general.name     = Glm-4.6
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: EOM token        = 151338 '<|observation|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151330 '[MASK]'
print_info: LF token         = 198 'ÄS'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: EOG token        = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
tensor blk.3.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.3.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA0
tensor blk.3.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.4.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.4.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA0
tensor blk.4.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA0
tensor blk.5.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.5.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA1
tensor blk.5.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.6.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.6.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA1
tensor blk.6.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA1
tensor blk.7.ffn_gate_exps.weight (675 MiB q4_K) buffer type overridden to CUDA2
tensor blk.7.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.7.ffn_up_exps.weight (675 MiB q4_K) buffer type overridden to CUDA2
tensor blk.8.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.8.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.8.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.9.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.9.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.9.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.10.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.10.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.10.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.11.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.11.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.11.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.12.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.12.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.12.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.13.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.13.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA2
tensor blk.13.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.14.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.14.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.14.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA2
tensor blk.15.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.15.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.15.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.16.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.16.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA3
tensor blk.16.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.17.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.17.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.17.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.18.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.18.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.18.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.19.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.19.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA3
tensor blk.19.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA3
tensor blk.20.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.20.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.20.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.21.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.21.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.21.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.22.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.22.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.22.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.23.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.23.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.23.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.24.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.24.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.24.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.25.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.25.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.25.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.26.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.26.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.26.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.27.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.27.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.27.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.28.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.28.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.28.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.29.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.29.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.29.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.30.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.30.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.30.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.31.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.31.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.31.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.32.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.32.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.32.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.33.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.33.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.33.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.34.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.34.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.34.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.35.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.35.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.35.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.36.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.36.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.36.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.37.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.37.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.37.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.38.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.38.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.38.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.39.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.39.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.39.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.40.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.40.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.40.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.41.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.41.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.41.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.42.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.42.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.42.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.43.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.43.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.43.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.44.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.44.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.44.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.45.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.45.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.45.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.46.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.46.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.46.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.47.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.47.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.47.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.48.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.48.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.48.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.49.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.49.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.49.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.50.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.50.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.50.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.51.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.51.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.51.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.52.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.52.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.52.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.53.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.53.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.53.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.54.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.54.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.54.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.55.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.55.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.55.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.56.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.56.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.56.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.57.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.57.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.57.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.58.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.58.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.58.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.59.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.59.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.59.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.60.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.60.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.60.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.61.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.61.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.61.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.62.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.62.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.62.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.63.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.63.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.63.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.64.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.64.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.64.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.65.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.65.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.65.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.66.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.66.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.66.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.67.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.67.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.67.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.68.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.68.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.68.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.69.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.69.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.69.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.70.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.70.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.70.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.71.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.71.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.71.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.72.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.72.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.72.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.73.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.73.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.73.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.74.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.74.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.74.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.75.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.75.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.75.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.76.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.76.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.76.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.77.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.77.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.77.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.78.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.78.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.78.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.79.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.79.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.79.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.80.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.80.ffn_down_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.80.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.81.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.81.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.81.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.82.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.82.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.82.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.83.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.83.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.83.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.84.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.84.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.84.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.85.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.85.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.85.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.86.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.86.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.86.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.87.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.87.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.87.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.88.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.88.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.88.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.89.ffn_gate_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.89.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.89.ffn_up_exps.weight (825 MiB q5_K) buffer type overridden to CUDA_Host
tensor blk.90.ffn_gate_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.90.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.90.ffn_up_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.91.ffn_gate_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.91.ffn_down_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
tensor blk.91.ffn_up_exps.weight (984 MiB q6_K) buffer type overridden to CUDA_Host
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 43253760 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 3604480 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 4300800 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 43253760 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_exps.weight (size = 1032192000 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 36044800 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
load_tensors: relocated tensors: 1002 of 1736
ggml_cuda_host_malloc: failed to allocate 183778.12 MiB of pinned memory: out of memory
load_tensors: offloading 93 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 94/94 layers to GPU
load_tensors:  CUDA1_Split model buffer size =  5892.46 MiB
load_tensors:  CUDA0_Split model buffer size =  5863.98 MiB
load_tensors:          CPU model buffer size =   508.75 MiB
load_tensors:          CPU model buffer size = 183778.12 MiB
load_tensors:        CUDA0 model buffer size =  4354.48 MiB
load_tensors:        CUDA1 model buffer size =  4354.31 MiB
load_tensors:        CUDA2 model buffer size = 20137.50 MiB
load_tensors:        CUDA3 model buffer size = 12693.75 MiB
..................................................................................load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
..load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
..load_all_data: using async uploads for device CUDA2, buffer type CUDA2, backend CUDA2
........load_all_data: using async uploads for device CUDA3, buffer type CUDA3, backend CUDA3
......
Automatic RoPE Scaling: Using model internal value.
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32896
llama_context: n_ctx_per_seq = 32896
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32896) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 32896 (padded)
llama_kv_cache: layer  92: does not have KV cache
llama_kv_cache:      CUDA0 KV buffer size =  6039.50 MiB
llama_kv_cache:      CUDA1 KV buffer size =  5782.50 MiB
llama_kv_cache: size = 11822.00 MiB ( 32896 cells,  92 layers,  1/1 seqs), K (f16): 5911.00 MiB, V (f16): 5911.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 5
llama_context: max_nodes = 13888
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context:      CUDA0 compute buffer size =  6314.26 MiB
llama_context:      CUDA1 compute buffer size =  6314.26 MiB
llama_context:      CUDA2 compute buffer size =   162.33 MiB
llama_context:      CUDA3 compute buffer size =   162.33 MiB
llama_context:  CUDA_Host compute buffer size =    78.26 MiB
llama_context: graph nodes  = 6988
llama_context: graph splits = 294 (with bs=512), 177 (with bs=1)
Threadpool set to 8 threads and 8 blasthreads...
attach_threadpool: call
GLM-4 will have no automatic BOS token.
Starting model warm up, please wait a moment...
D:\a\koboldcpp\koboldcpp\ggml\src\ggml-backend.cpp:1496: GGML_ASSERT(id >= 0 && id < n_expert) failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions