Skip to content

Model doesn't load on >2 GPU anymore. Says ggml_new_object: not enough space in the context's memory pool #4114

Closed
@Ph0rk0z

Description

@Ph0rk0z

Expected Behavior

Model loaded to 2x3090 + 1 or 2 P40 loads and functions:

llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size  = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 65972.68 MB (model: 63212.67 MB, context: 2760.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-16 12:33:35 INFO:Loaded the model in 136.40 seconds.

Current Behavior

Model fails with an error:


ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  141.08 MB
llm_load_tensors: offloading 137 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 140/140 layers to GPU
llm_load_tensors: VRAM used: 63212.67 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size  = 2192.00 MB
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)
Segmentation fault (core dumped)

Failure Information (for bugs)

I'm mainly using the python bindings but v 2.17 works and v 2.18 doesn't. Same settings. I try to load 180b or 120b and this is what I get. I have more than enough vram but for some reason it dies on CPU ram despite the model already being loaded.

I tried numa and mlock to no avail. This is using MMQ kernels so nothing there should have changed.

Last commit it was working on was : df9d129

I tried reverting 1cf2850 manually but that wasn't it.

Will also try with today's commits and update to see what happens. Eliminated any code in the python wrapper by using llama.cpp rev that worked on the 2.18 version.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions