Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul #1483

Merged
merged 35 commits into from
May 20, 2023

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 16, 2023

Currently my implementation of GPU acceleration has inefficient memory management: weights are loaded into RAM and then copied to VRAM. My current goal is to fix this. This PR is a step towards that goal.

Ideally you would directly load parameters from disk into VRAM. However, currently not all parameters can be loaded into VRAM. Between the weight matrices there are two norms per layer. Managing this would make the implementation more complicated. But if all weights in a layer are in VRAM the implementation will be simpler. So this PR implements GPU acceleration for norms (or rather ggml_mul) even though this in and of itself is of rather low priority; for full GPU acceleration we would have needed this eventually though.

On master the norms are first repeated and then multiplied with another tensor. To keep the CUDA code simpler I have extended ggml_mul (for both CPU and CUDA) to allow the broadcasting of values: if src1 is smaller than src0 in some dimensions its values are being repeated during the multiplication. That way a repeat CUDA kernel is not needed. The ggml graph is also a little smaller which in theory reduces overhead.

There don't seem to be significant performance differences for generating tokens:

CPU RAM GPU Model -ngl ms/t master ms/t ggml_mul broadcast ms/t ggml_mul broadcast + GPU norms
3700X 3200 MHz RTX 3090 7b q4_0 0 115.14 115.08 115.00
3700X 3200 MHz RTX 3090 7b q4_0 33 25.24 24.51 25.30
i5-4570S 1600 MHz GTX 1070 7b q4_0 0 205.39 201.77 201.66
i5-4570S 1600 MHz GTX 1070 7b q4_0 33 71.33 71.22 72.36

@JohannesGaessler JohannesGaessler added the enhancement New feature or request label May 16, 2023
@slaren
Copy link
Collaborator

slaren commented May 16, 2023

The CUDA code looks fine, but I would suggest some changes to avoid an explosion of #ifdef GGML_USE_CUBLAS everywhere:

  • Instead of changing the behavior of ggml_mul for CUDA only, add the same auto-broadcast behavior to the CPU implementation and remove the ifdefs from llama.cpp. If that's not possible, I would prefer implementing ggml_repeat in CUDA.
  • Instead of hooking the CUDA stuff in ggml_compute_forward_mul_f32, do it in a generic way in ggml_compute_forward. I am thinking of something like this at the start of the function:
    #ifdef GGML_USE_CUBLAS
    if (ggml_cuda_compute_forward(params, tensor)) {
        return;
    }
    #endif
    Then in the future, when you implement more operations in CUDA, you only need to change ggml_cuda_compute_forward without adding more changes to ggml.c. You could also remove the hooks in ggml_compute_forward_mul_mat_* in this way.

@ggerganov may disagree about this, wait for his opinion.

@JohannesGaessler
Copy link
Collaborator Author

There is a CPU implementation for broadcasting of tensors in ggml_mul. It's just that I did not implement a version that utilizes hardware acceleration on CPUs so I thought I would disable it by default.

@ggerganov
Copy link
Owner

It's just that I did not implement a version that utilizes hardware acceleration on CPUs so I thought I would disable it by default.

Not sure what you mean by "hardware acceleration on CPUs" - do you mean it is not SIMD-fied? If yes, then it is not a problem.

I think the way you've implemented it is fine and should be enabled by default on the CPU as @slaren recommended

Make sure this does not break the backward computation - run test-grad0 or baby-llama and see if they succeed.

Instead of hooking the CUDA stuff in ggml_compute_forward_mul_f32, do it in a generic way in ggml_compute_forward

Yes, this is better

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 17, 2023

This is the output of baby-llama:

/home/johannesg/Projects/llama.cpp [git::9ca9b35 *] [johannesg@johannes-pc] [9:56]
> ./bin/baby-llama
init model
init_kv_cache
Example 1
error_before_opt: 4868.85
error_after_opt:  130.75
best samples after optimization:
 X      
      X 
       X
       X
       X
       X
     X  
    X   
Example 9
error_before_opt: 123.92
error_after_opt:  64.10
Example 17
error_before_opt: 111.71
error_after_opt:  68.46
Example 25
error_before_opt: 133.40
error_after_opt:  60.04
Example 33
error_before_opt: 113.35
error_after_opt:  62.68
Example 41
error_before_opt: 124.23
error_after_opt:  62.61
Example 49
error_before_opt: 99.31
error_after_opt:  56.44
Example 57
error_before_opt: 90.20
error_after_opt:  64.43
Example 65
error_before_opt: 120.12
error_after_opt:  67.72
best samples after optimization:
       X
       X
       X
     X  
    X   
   X    
 X      
 X      
Example 73
error_before_opt: 94.94
error_after_opt:  68.29
Example 81
error_before_opt: 112.22
error_after_opt:  72.07
Example 89
error_before_opt: 101.33
error_after_opt:  63.93
Example 97
error_before_opt: 105.65
error_after_opt:  66.76
Example 105
error_before_opt: 109.33
error_after_opt:  66.09
Example 113
error_before_opt: 106.19
error_after_opt:  75.02
Example 121
error_before_opt: 104.38
error_after_opt:  72.19
Example 129
error_before_opt: 98.16
error_after_opt:  60.29
best samples after optimization:
      X 
    X   
   X    
 X      
 X      
 X      
 X      
  X     
Example 137
error_before_opt: 105.82
error_after_opt:  74.33
Example 145
error_before_opt: 99.59
error_after_opt:  57.73
Example 153
error_before_opt: 112.17
error_after_opt:  70.00
Example 161
error_before_opt: 97.33
error_after_opt:  66.69
Example 169
error_before_opt: 118.72
error_after_opt:  69.92
Example 177
error_before_opt: 92.20
error_after_opt:  67.64
Example 185
error_before_opt: 90.10
error_after_opt:  66.69
Example 193
error_before_opt: 90.30
error_after_opt:  62.23
best samples after optimization:
 X      
 X      
 X      
 X      
  X     
   X    
    X   
     X  
Example 201
error_before_opt: 93.03
error_after_opt:  74.56
Example 209
error_before_opt: 122.55
error_after_opt:  71.63
Example 217
error_before_opt: 98.15
error_after_opt:  75.24
Example 225
error_before_opt: 94.61
error_after_opt:  70.88
Example 233
error_before_opt: 94.37
error_after_opt:  68.48
Example 241
error_before_opt: 100.92
error_after_opt:  85.19
Example 249
error_before_opt: 89.59
error_after_opt:  60.67
Generating 128 tokens.
X       
 X      
  X     
    X   
     X  
      X 
---
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
       X
       X
       X
      X 
     X  
    X   
  X     
 X      
 X      
 X      
 X      
  X     
   X    
     X  
      X 
       X
 1.16 -0.75 -0.90 -0.53 0.72 -1.50 -0.79 -0.18 -1.00 -1.34 1.47 0.87 -0.32 -2.30 -1.80 1.69 0.51 0.29 -1.09 -0.31 -1.48 1.18 0.92 -1.09 1.57 -0.85 0.10 0.57 -0.57 -1.88 -1.59 -0.40
 0.42 1.35 -0.54 1.07 -0.12 -0.96 0.93 -0.30 -1.19 0.39 -0.39 -0.39 -0.68 -1.41 1.50 -0.35 0.68 0.24 0.46 0.01 0.22 1.16 0.45 -0.41 0.71 0.51 0.63 0.05 0.83 0.94 -0.06 -3.29
 -0.40 -2.60 0.67 -0.56 0.58 -0.16 1.35 -0.00 -0.67 0.04 0.07 1.33 1.08 -0.03 0.71 0.41 -0.77 -0.45 0.96 0.91 0.31 -0.20 0.12 -0.17 -0.00 0.20 -1.55 -0.06 0.60 -1.69 -0.43 0.53
 1.04 0.41 0.34 1.05 0.65 -0.73 -0.60 0.81 0.46 1.07 0.52 -1.40 1.25 -0.69 1.63 0.44 -1.09 -1.85 0.78 -0.45 0.60 -0.90 -0.83 -1.14 0.86 -0.63 0.74 -0.15 -1.02 1.88 -0.98 -0.35
 -0.33 0.52 0.19 0.45 -0.60 1.04 1.64 0.92 0.82 1.13 -1.10 -0.74 2.15 -0.84 -0.84 1.04 -1.59 -0.19 -0.50 -0.97 0.15 1.19 -1.17 0.18 -0.36 1.92 0.11 -0.92 -1.15 -0.85 1.32 -0.05
 0.40 -0.35 0.69 1.11 0.10 0.01 1.79 -0.75 0.28 1.27 -0.18 -0.11 -1.27 0.41 -1.64 -1.24 1.61 -1.88 0.99 -0.36 -1.06 1.40 -1.52 -0.57 0.25 -0.86 -0.18 -1.09 -0.49 -0.73 -0.05 -0.19
 -0.75 0.52 -0.99 0.60 -0.53 -0.84 -1.39 -0.78 0.64 0.20 -0.93 0.15 -0.24 1.03 -2.22 0.36 1.30 -0.38 -0.50 -0.42 -0.66 -0.65 -0.42 -0.39 -0.47 1.11 0.65 -0.45 -0.66 1.64 -0.82 1.14
 0.04 0.93 1.67 -0.59 0.35 0.74 1.29 -0.67 -1.61 0.13 -0.57 -1.23 0.11 -0.26 1.02 0.52 1.76 0.98 -0.73 0.68 0.07 -1.05 0.24 0.41 1.06 0.23 0.87 0.98 -1.44 1.44 1.34 1.82
done

I think this is what it's supposed to look like?

@JohannesGaessler
Copy link
Collaborator Author

I've quickly tried moving the calls for CUDA functions to a new function ggml_cuda_compute_forward that is equivalent to ggml_compute_forward and then calling that function from ggml_compute_forward but this is causing issues with multithreading. The CUDA functions only replace the GGML_TASK_COMPUTE parts but are still reliant on the CPU implementation for GGML_TASK_INIT and GGML_TASK_FINALIZE. Overall it seems like it would be kind of tricky to implement. Can we postpone moving the calls to CUDA functions to ggml_compute_forward until we have more than two of them?

@JohannesGaessler
Copy link
Collaborator Author

With the new broadcasting for ggml_mul building on macOS seems to fail. I don't have any Apple devices on which I could debug this; should I just disable it by default again?

@Green-Sky
Copy link
Collaborator

With the new broadcasting for ggml_mul building on macOS seems to fail. I don't have any Apple devices on which I could debug this; should I just disable it by default again?

you can still check the CI build logs.

@JohannesGaessler
Copy link
Collaborator Author

I know but that won't tell me anything about whether or not my implementation produces correct results.

@JohannesGaessler JohannesGaessler changed the title Norm calculation on GPUs, broadcasting for ggml_mul Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul May 17, 2023
@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 17, 2023

I implemented loading models directly into VRAM and pushed it to this PR; sorry for the feature creep. I'm getting 1.7 t/s for 65b q4_0 with a 3090 and 32 GB RAM @ 3200 MHz. This implementation works with EXT4 and NTFS but not BTRFS.

Models are loaded via cuFile. This allows the GPU to directly access the disk when loading weights. Originally I wanted to have the implementation entirely in ggml-cuda.cu but the problem with that is that it would have required a lot of messing with the internals of llama_model_loader. So I instead opted to extend llama_model_loader and llama_file. It's intended to work like this: when building the tensors they get assigned either CPU or CUDA as backend. Then when the actual model is being loaded this information is used to put the data either into RAM or VRAM.

Question: when are there multiple input files? For simplicity I assumed there to be only one.

@JohannesGaessler JohannesGaessler marked this pull request as draft May 17, 2023 17:48
@JohannesGaessler
Copy link
Collaborator Author

33b models with 16 GB RAM and 8 GB VRAM still do not seem to be viable. On my headless Linux server I'm ~1 GB short.

@ggerganov
Copy link
Owner

ggerganov commented May 17, 2023

33b models with 16 GB RAM and 8 GB VRAM still do not seem to be viable. On my headless Linux server I'm ~1 GB short.

When #1508 is ready, it might be just enough to fit it

With the new broadcasting for ggml_mul building on macOS seems to fail. I don't have any Apple devices on which I could debug this; should I just disable it by default again?

I'll fix this over the weekend

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 17, 2023

I'll fix this over the weekend

Thanks, I'll leave it to you. What I myself still need to do is implement a workaround for Btrfs and fix the info prints for RAM/VRAM usage. Done.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 17, 2023

On my Linux server with an i5-4570s, 1600 MHz RAM, and a GTX 1070 when loading 7b q4_0 from a SATA SSD with an empty file system cache:
With cuFile: 15.752 s initialization
With the workaround implementation: 16.949 s initialization.

On the second run:
With cuFile: 4.511 s
With the workaround implementation: 3.143 s

@slaren
Copy link
Collaborator

slaren commented May 17, 2023

Maintaining these changes in llama.cpp or supporting a different backend like opencl is going to be a nightmare. IMO this should be abstracted to the absolute minimum number of dependencies with CUDA and ggml-cuda, and if a dependency cannot be avoided reasonably, it should be clearly isolated from the rest of the code.

@Green-Sky
Copy link
Collaborator

afaik, cufile is also linux only?

@JohannesGaessler
Copy link
Collaborator Author

Thank you for your input. I understand my PRs to be points of discussion and am willing to implement changes (since that's relatively easy once you have a working version). To make sure there is no misunderstanding: cuFile is included in the CUDA toolkit and would not require installing additional software.

  • Regarding CUDA code in llama.cpp: I think preserving the object-oriented design is slightly more important than keeping CUDA code confined to ggml-cuda.cu but I don't have a strong opinion on this.
  • Regarding cuFile: I think the difference in loading speed would be nice to have but not essential; ultimately I had to implement it first to get concrete numbers.
  • Regarding OSs other than Linux: cuFile can read from NTFS (Windows) partitions so I would have assumed that it works under Windows. Can someone test this?

@Green-Sky
Copy link
Collaborator

cmake support for cuFile requires at least 3.25
https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cufile
that's a bit too new

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 17, 2023

I collected some data with my main machine (RTX 3090, Ryzen 3700X, 32 GB RAM @ 3200 MHz). The test command I used is time ./main --model /path/to/model --ignore-eos --n_predict 0 --ctx_size 2048 --batch_size 512 --seed 1337 --threads 8 --gpu_layers 99 and I simply measured the runtime. Results:

Model SSD type 1st initialization? Runtime cuFile [s] Runtime workaround [s]
7b q4_0 SATA Yes 11.251 12.699
7b q4_0 SATA No 2.692 4.113
7b q4_0 NVME Yes 4.471 5.518
7b q4_0 NVME No 2.740 4.117
13b q4_0 SATA Yes 20.280 23.149
13b q4_0 SATA No 3.730 6.492
13b q4_0 NVME Yes 6.118 9.210
13b q4_0 NVME No 3.715 6.602
33b q4_0 SATA Yes 48.123 54.403
33b q4_0 SATA No 6.394 13.632
33b q4_0 NVME Yes 12.878 20.442
33b q4_0 NVME No 6.528 13.607

The difference seems to be ~1.5 s for 7b, ~3 s for 13b, and ~6 s for 33b.

@JohannesGaessler
Copy link
Collaborator Author

I force-pushed a version based on the feedback. This version keeps any cuFile-related code entirely in ggml-cuda.cu. The code for loading weights into VRAM without it is still in llama.cpp. This is because it relies on llama_file which handles a lot of the OS-specific problems. If the code for loading weights was in ggml_cuda.cu instead then either ggml_cuda.cu would need to include llama-util.h or the relevant code would need to be copy-pasted - I think both of these options are undesirable.

I designed the new version to make it easy to either remove cuFile-related code entirely or to turn it into a compilation option. Opinions regarding this decisions would be appreciated.

The original version can be found here.

@Green-Sky
Copy link
Collaborator

to turn it into a compilation option.

I vote for this.

Has anyone a direct storage system to test this? Either with "Magnum IO GPUDirect Storage" or maybe windows direct storage or something. would be interesting to see.

@rankaiyx
Copy link
Contributor

nice! Is the system with 32G RAM and 11G VRAM expected to run the 65b model? Or 24G VRAM?

@JohannesGaessler
Copy link
Collaborator Author

Is the system with 32G RAM and 11G VRAM expected to run the 65b model?

On a headless Linux server it should be possible on Winbloats maybe not so much.

ggerganov and others added 14 commits May 20, 2023 13:11
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics
* Fix name shadowing and C4146

* Fix if macros not using defined when required

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Code style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

llama.cpp Outdated Show resolved Hide resolved
llama.cpp Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've update the llama inference to use broadcasted ggml_mul() always.
Cannot do it in baby-llama yet because the backward pass is not implemented (added asserts to prevent from anyone using it for now).

I think we have to drop cufile - it requires CMake 3.25 and I think currently breaks Windows build for some reason.
Does not seem worth it given the latest benchmarks.

P.S. Hope I didn't mess up something during the rebase that I just did

@JohannesGaessler
Copy link
Collaborator Author

I removed cuFile and cleaned up the rebase as best as I could. Give me a little more time and I'll clean up the code a little now that I no longer need to consider cuFile.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

llama.cpp Outdated
// "output" tensor
{
ggml_backend backend_output;
if (n_gpu_layers > int(n_layer)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: if with identical then and else branches [bugprone-branch-clone]

            if (n_gpu_layers > int(n_layer)) {
            ^
Additional context

llama.cpp:1024: else branch starts here

            } else {
              ^

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

llama.cpp Outdated
// "output" tensor
{
ggml_backend backend_output;
if (n_gpu_layers > int(n_layer)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: if with identical then and else branches [bugprone-branch-clone]

            if (n_gpu_layers > int(n_layer)) {
            ^
Additional context

llama.cpp:1022: else branch starts here

            } else {
              ^

@JohannesGaessler
Copy link
Collaborator Author

I moved the loop for loading weights from ggml-cuda.cu to llama.cpp. I think this makes the code easier to understand and it also allowed me to fix the progress callback (but now the calculations for that maybe should be deduplicated). As far as I'm concerned I think this can be merged now but I can quickly revise the progress callback code. One thing that I've noticed is that the regular progress indicator is only enabled for mlock because with mmap it's not measurable. With GPU offloading it's always measurable though so I set it to always be enabled.

@ggerganov ggerganov merged commit affc76e into ggerganov:master May 20, 2023
@JohannesGaessler
Copy link
Collaborator Author

Alright, thanks for the help everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.