Model doesn't load on >2 GPU anymore. Says ggml_new_object: not enough space in the context's memory pool

# Expected Behavior

Model loaded to 2x3090 + 1 or 2 P40 loads and functions:

```
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size  = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 65972.68 MB (model: 63212.67 MB, context: 2760.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-16 12:33:35 INFO:Loaded the model in 136.40 seconds.
```


# Current Behavior

Model fails with an error:
```

ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  141.08 MB
llm_load_tensors: offloading 137 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 140/140 layers to GPU
llm_load_tensors: VRAM used: 63212.67 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size  = 2192.00 MB
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)
Segmentation fault (core dumped)
```



# Failure Information (for bugs)

I'm mainly using the python bindings but v 2.17 works and v 2.18 doesn't. Same settings. I try to load 180b or 120b and this is what I get. I have more than enough vram but for some reason it dies on CPU ram despite the model already being loaded.

I tried numa and mlock to no avail. This is using MMQ kernels so nothing there should have changed.

Last commit it was working on was : https://github.com/ggerganov/llama.cpp/commit/df9d1293defe783f42bc83af732d3c670552c541

I tried reverting https://github.com/ggerganov/llama.cpp/commit/1cf2850d52bb027aa215e039ed9a0c61beeef8d3 manually but that wasn't it.

Will also try with today's commits and update to see what happens. Eliminated any code in the python wrapper by using llama.cpp rev that worked on the 2.18 version.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model doesn't load on >2 GPU anymore. Says ggml_new_object: not enough space in the context's memory pool #4114

Expected Behavior

Current Behavior

Failure Information (for bugs)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model doesn't load on >2 GPU anymore. Says ggml_new_object: not enough space in the context's memory pool #4114

Description

Expected Behavior

Current Behavior

Failure Information (for bugs)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions