Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High RAM usag when offloading to GPU layers. #35

Closed
Mradr opened this issue Jun 27, 2023 · 2 comments
Closed

High RAM usag when offloading to GPU layers. #35

Mradr opened this issue Jun 27, 2023 · 2 comments

Comments

@Mradr
Copy link

Mradr commented Jun 27, 2023

Just updated my GPU from a 2080 to a 3090 and man does it makes things go brrrr lol.

Anyways, I notice a new strange behavor when I did. Instead of model + GPU taking close to what the model took in system ram... it now takes almost double the system ram. When offloading from say 8 to 100 using the model wizardLM-13B-Uncensored.ggmlv3.q4_0.bin I jump from 6-7 GB to almost 12 to 14 GB on system RAM - even more as I increase the number of GPU layers. I was under the impression that more GPU_Layers the less system memory it should be using not more?

    def load_chat_model( self, model = "wizardLM-13B-Uncensored.ggmlv3.q4_0.bin" ): 
        self.gptj = AutoModelForCausalLM.from_pretrained(
            f'models/{model}',
            model_type = 'llama', #mpt, llama
            reset = True,
            threads = 1, gpu_layers = 100,
            context_length = 2048, #8192, 2048
            batch_size = 2048,
            temperature = float( .65 ),
            repetition_penalty = float( 1.1 )
        )

While I have the RAM for it - just seems very very strange it should be taking even more system ram than ever before.

While not 100% related, I could be just simply doing something wrong with the settings, I had another issue where when I did offload to the GPU when I had my 2080 - things were slow. The fix for it was to increase the batch_size and that did improve the performance even just on 8 layers. Changing the batch in this case doesnt seem to change much for memory usage only the gpu_layers seem to be the issue.
#27 As noted here, I dont seem to get "out of memory" errors when I increase the GPU layers - it will jsut "oom" if I go past too many layers for my GPU VRAM relying on the system threads instead.

ctransformers 0.2.10
Windows 11
3090
CUDA supported
Python 3.10
32GB of RAM

RAM Usage after load + message | system ram before loading | difference
threads = 8,
CPU only: 14.0 - 7.3 = 7GB

threads = 1, gpu_layers = 50,
1T + GPU: 20.9 - 7.3 = 13GB

A little more testing I see it scales up to about an extra 5GB of data for the system RAM before it caps out increasing a little bit per layer between 1-50. Almost seems like it not releasing the "work load" that it was planning on sending to GPU.

@marella
Copy link
Owner

marella commented Jul 3, 2023

Can you please try running the same config using the latest llama.cpp binary.

Also please try with a specific commit of llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b24c304

# build and run

Please let me know whether you are seeing similar issue with the binary.

@marella
Copy link
Owner

marella commented Aug 7, 2023

There have been many changes done to llama.cpp in the past few weeks, so hopefully this issue should be resolved now.
Please try with the latest version and if you are still facing an issue, feel free to re-open.

@marella marella closed this as completed Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants