Closed
Description
Expected Behavior
Model loaded to 2x3090 + 1 or 2 P40 loads and functions:
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 65972.68 MB (model: 63212.67 MB, context: 2760.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2023-11-16 12:33:35 INFO:Loaded the model in 136.40 seconds.
Current Behavior
Model fails with an error:
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 141.08 MB
llm_load_tensors: offloading 137 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 140/140 layers to GPU
llm_load_tensors: VRAM used: 63212.67 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size = 2192.00 MB
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)
Segmentation fault (core dumped)
Failure Information (for bugs)
I'm mainly using the python bindings but v 2.17 works and v 2.18 doesn't. Same settings. I try to load 180b or 120b and this is what I get. I have more than enough vram but for some reason it dies on CPU ram despite the model already being loaded.
I tried numa and mlock to no avail. This is using MMQ kernels so nothing there should have changed.
Last commit it was working on was : df9d129
I tried reverting 1cf2850 manually but that wasn't it.
Will also try with today's commits and update to see what happens. Eliminated any code in the python wrapper by using llama.cpp rev that worked on the 2.18 version.