High RAM usag when offloading to GPU layers. #35

Mradr · 2023-06-27T14:11:08Z

Just updated my GPU from a 2080 to a 3090 and man does it makes things go brrrr lol.

Anyways, I notice a new strange behavor when I did. Instead of model + GPU taking close to what the model took in system ram... it now takes almost double the system ram. When offloading from say 8 to 100 using the model wizardLM-13B-Uncensored.ggmlv3.q4_0.bin I jump from 6-7 GB to almost 12 to 14 GB on system RAM - even more as I increase the number of GPU layers. I was under the impression that more GPU_Layers the less system memory it should be using not more?

    def load_chat_model( self, model = "wizardLM-13B-Uncensored.ggmlv3.q4_0.bin" ): 
        self.gptj = AutoModelForCausalLM.from_pretrained(
            f'models/{model}',
            model_type = 'llama', #mpt, llama
            reset = True,
            threads = 1, gpu_layers = 100,
            context_length = 2048, #8192, 2048
            batch_size = 2048,
            temperature = float( .65 ),
            repetition_penalty = float( 1.1 )
        )

While I have the RAM for it - just seems very very strange it should be taking even more system ram than ever before.

While not 100% related, I could be just simply doing something wrong with the settings, I had another issue where when I did offload to the GPU when I had my 2080 - things were slow. The fix for it was to increase the batch_size and that did improve the performance even just on 8 layers. Changing the batch in this case doesnt seem to change much for memory usage only the gpu_layers seem to be the issue.
#27 As noted here, I dont seem to get "out of memory" errors when I increase the GPU layers - it will jsut "oom" if I go past too many layers for my GPU VRAM relying on the system threads instead.

ctransformers 0.2.10
Windows 11
3090
CUDA supported
Python 3.10
32GB of RAM

RAM Usage after load + message | system ram before loading | difference
threads = 8,
CPU only: 14.0 - 7.3 = 7GB

threads = 1, gpu_layers = 50,
1T + GPU: 20.9 - 7.3 = 13GB

A little more testing I see it scales up to about an extra 5GB of data for the system RAM before it caps out increasing a little bit per layer between 1-50. Almost seems like it not releasing the "work load" that it was planning on sending to GPU.

The text was updated successfully, but these errors were encountered:

marella · 2023-07-03T18:04:15Z

Can you please try running the same config using the latest llama.cpp binary.

Also please try with a specific commit of llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b24c304

# build and run

Please let me know whether you are seeing similar issue with the binary.

marella · 2023-08-07T19:36:47Z

There have been many changes done to llama.cpp in the past few weeks, so hopefully this issue should be resolved now.
Please try with the latest version and if you are still facing an issue, feel free to re-open.

marella closed this as completed Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High RAM usag when offloading to GPU layers. #35

High RAM usag when offloading to GPU layers. #35

Mradr commented Jun 27, 2023 •

edited

Loading

marella commented Jul 3, 2023

marella commented Aug 7, 2023

High RAM usag when offloading to GPU layers. #35

High RAM usag when offloading to GPU layers. #35

Comments

Mradr commented Jun 27, 2023 • edited Loading

marella commented Jul 3, 2023

marella commented Aug 7, 2023

Mradr commented Jun 27, 2023 •

edited

Loading