WIP: Flash Attention implementation (forward + backward) #1

FSSRepo · 2024-01-17T22:23:35Z

Previously, the initiative to implement Flash Attention to improve inference performance in llama.cpp had already been introduced. However, it was assumed that this approach would yield the expected results on the CPU, for that reason, it was discarded, and no further follow-up was given.

Flash Attention is actually designed to enhance GPU resource utilization through the use of tensor cores and shared memory, which is 30 times faster than global memory (VRAM), reducing unnecessary readings and writings.

Implementing this algorithm is particularly challenging because it requires taking into account hardware limitations and conflicts that can degrade performance, even when everything is expected to be fine (a.k.a. shared memory banks conflicts).

Tasks to be carried out during the execution of this project:

Analyzing and understanding concisely how Flash Attention works. Even if it's just creating a C++ code that replicates the exact functionality without much complexity. From this base, we move on to parallelize and optimize.
Optimize and support different models.
Create the backward kernel for training.

Run test:

# get the code
git clone https://github.com/Pints-App/llama.cpp.git
cd llama.cpp

# build
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

# Run test (get result)
bin/test-flash-attention

# Run test (compare backends)
bin/test-flash-attention comp

ggml-cuda.cu

JohannesGaessler · 2024-01-18T15:55:31Z

Create the backward kernel for training.

Currently I don't think llama.cpp has good support for training at all. I want to work on it at some point but implementing a backwards pass currently doesn't make sense I think.

JohannesGaessler · 2024-01-18T16:02:21Z

Optimize and support different models.

In terms of optimization it will rather need to be done for different GPU + batch size combinations. Fundamentally you want to use large SRAM tiles because that increases arithmetic intensity: you can do more computations per load from global memory. However, at the same time you want to split a computing task into as many blocks as possible since this reduces so-called tail effects where the last wave of a kernel does not have enough blocks to keep all of the streaming multiprocessors on a GPU busy. So for optimizing performance you will need to find a balance between these two problems which is affected by batch size because the number of blocks that the GPU can work on should be proportional to batch size.

What I think would be the best solution is to compile the kernel multiple times with different template parameters and to then choose the best implementation at runtime based on the number of streaming multiprocessors on the GPU and teh shapes of the matrices. I am currently working on essentially this exact problem in ggml-org#4801 (but for regular matrix multiplication).

JohannesGaessler · 2024-01-18T16:04:49Z

Since the target architectures are Ampere and Ada Lovelace consider that copying data from VRAM to SRAM is more efficient using asynchronous data copies. This is both faster and reduces register pressure. It also allows you to load data and do computations in parallel. This optimization can be added later though. I will soon push a prototype to ggml-org#4801 which (I think) is close to optimal.

JohannesGaessler · 2024-01-18T16:23:21Z

The installation instructions seem to be missing cmake .. -DLLAMA_CUBLAS=ON. In any case, I cannot compile the project. I get the following error message:


/home/johannesg/Projects/llama.cpp/build [git::flash-attn-cuda *] [johannesg@johannes-pc] [17:19]
> cmake --build . --config Release
[  5%] Built target ggml
[  6%] Built target ggml_static
[  8%] Built target llama
[ 10%] Built target build_info
[ 16%] Built target common
[ 17%] Built target test-quantize-fns
[ 19%] Built target test-quantize-perf
[ 21%] Built target test-sampling
[ 23%] Built target test-tokenizer-0-llama
[ 25%] Built target test-tokenizer-0-falcon
[ 27%] Built target test-tokenizer-1-llama
[ 29%] Built target test-tokenizer-1-bpe
[ 31%] Built target test-grammar-parser
[ 32%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/test-llama-grammar.cpp.o
[ 33%] Linking CXX executable ../bin/test-llama-grammar
[ 33%] Built target test-llama-grammar
[ 34%] Building CXX object tests/CMakeFiles/test-grad0.dir/test-grad0.cpp.o
[ 35%] Linking CXX executable ../bin/test-grad0
[ 35%] Built target test-grad0
[ 36%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/test-backend-ops.cpp.o
[ 37%] Linking CXX executable ../bin/test-backend-ops
[ 37%] Built target test-backend-ops
[ 38%] Building CXX object tests/CMakeFiles/test-rope.dir/test-rope.cpp.o
[ 39%] Linking CXX executable ../bin/test-rope
[ 39%] Built target test-rope
[ 40%] Building CXX object tests/CMakeFiles/test-flash-attention.dir/test-flash-attention.cpp.o
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:5: warning: "GGML_USE_CUBLAS" redefined
    5 | #define GGML_USE_CUBLAS
      | 
<command-line>: note: this is the location of the previous definition
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp: In function ‘bool isinf_or_max(float)’:
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:74:34: error: ‘FLT_MAX’ was not declared in this scope
   74 |     return std::isinf(f) || f == FLT_MAX || f == -FLT_MAX;
      |                                  ^~~~~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:19:1: note: ‘FLT_MAX’ is defined in header ‘<cfloat>’; did you forget to ‘#include <cfloat>’?
   18 | #include <vector>
  +++ |+#include <cfloat>
   19 | 
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp: At global scope:
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:93:6: warning: no previous declaration for ‘void ggml_tensor_set_f32(ggml_tensor*, float, int, int, int, int)’ [-Wmissing-declarations]
   93 | void ggml_tensor_set_f32(struct ggml_tensor* tensor, float value, int l, int k = 0, int j = 0, int i = 0) {
      |      ^~~~~~~~~~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:98:7: warning: no previous declaration for ‘float ggml_tensor_get_f32(const ggml_tensor*, int, int, int, int)’ [-Wmissing-declarations]
   98 | float ggml_tensor_get_f32(const ggml_tensor* tensor, int l, int k = 0, int j = 0, int i = 0) {
      |       ^~~~~~~~~~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:103:6: warning: no previous declaration for ‘void load_model(test_model&, bool)’ [-Wmissing-declarations]
  103 | void load_model(test_model & model, bool use_gpu = false) {
      |      ^~~~~~~~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp: In function ‘void load_model(test_model&, bool)’:
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:146:44: warning: ‘double ggml_type_sizef(ggml_type)’ is deprecated: use ggml_row_size() instead [-Wdeprecated-declarations]
  146 |         buffer_size += 30 * ggml_type_sizef(GGML_TYPE_F32); // tensor q
      |                             ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
In file included from /home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:1:
/home/johannesg/Projects/llama.cpp/./ggml.h:672:21: note: declared here
  672 |     GGML_API double ggml_type_sizef(enum ggml_type type), // ggml_type_size()/ggml_blck_size() as float
      |                     ^~~~~~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/./ggml.h:202:41: note: in definition of macro ‘GGML_DEPRECATED’
  202 | #    define GGML_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
      |                                         ^~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:147:44: warning: ‘double ggml_type_sizef(ggml_type)’ is deprecated: use ggml_row_size() instead [-Wdeprecated-declarations]
  147 |         buffer_size += 24 * ggml_type_sizef(GGML_TYPE_F32); // tensor k
      |                             ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/./ggml.h:672:21: note: declared here
  672 |     GGML_API double ggml_type_sizef(enum ggml_type type), // ggml_type_size()/ggml_blck_size() as float
      |                     ^~~~~~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/./ggml.h:202:41: note: in definition of macro ‘GGML_DEPRECATED’
  202 | #    define GGML_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
      |                                         ^~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:148:44: warning: ‘double ggml_type_sizef(ggml_type)’ is deprecated: use ggml_row_size() instead [-Wdeprecated-declarations]
  148 |         buffer_size += 24 * ggml_type_sizef(GGML_TYPE_F32); // tensor v
      |                             ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/./ggml.h:672:21: note: declared here
  672 |     GGML_API double ggml_type_sizef(enum ggml_type type), // ggml_type_size()/ggml_blck_size() as float
      |                     ^~~~~~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/./ggml.h:202:41: note: in definition of macro ‘GGML_DEPRECATED’
  202 | #    define GGML_DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
      |                                         ^~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp: At global scope:
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:203:22: warning: no previous declaration for ‘ggml_cgraph* build_graph(const test_model&, ggml_allocr*)’ [-Wmissing-declarations]
  203 | struct ggml_cgraph * build_graph(const test_model& model, struct ggml_allocr * allocr) {
      |                      ^~~~~~~~~~~
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp: In function ‘ggml_cgraph* build_graph(const test_model&, ggml_allocr*)’:
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:203:80: warning: unused parameter ‘allocr’ [-Wunused-parameter]
  203 | build_graph(const test_model& model, struct ggml_allocr * allocr) {
      |                                      ~~~~~~~~~~~~~~~~~~~~~^~~~~~

/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp: At global scope:
/home/johannesg/Projects/llama.cpp/tests/test-flash-attention.cpp:226:21: warning: no previous declaration for ‘ggml_tensor* compute_graph(const test_model&, ggml_backend_t, ggml_allocr*, bool)’ [-Wmissing-declarations]
  226 | struct ggml_tensor* compute_graph(const test_model & model, ggml_backend_t backend_cpu, struct ggml_allocr * allocr, bool compare_backends) {
      |                     ^~~~~~~~~~~~~
make[2]: *** [tests/CMakeFiles/test-flash-attention.dir/build.make:76: tests/CMakeFiles/test-flash-attention.dir/test-flash-attention.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:1929: tests/CMakeFiles/test-flash-attention.dir/all] Error 2

JohannesGaessler · 2024-01-18T16:25:29Z

As hinted at by the error message, it can be fixed with #include <cfloat>.

…into flash-attn-cuda

FSSRepo · 2024-01-24T16:10:16Z

@JohannesGaessler I'm trying to do a port of the implementation that Georgi did, but I'm running into issues where CUDA doesn't support fp16 operators. I need to use the functions from cuda_fp16.h, but when I compile, I get that they are undefined.

C:\proyects\llama.cpp\ggml-cuda.cu(6252): error : more than one conversion function from "half" to a built-in type applies: [C:\proyects\llama.cpp\build\ggml.vcxpro 
j]
              function "__half::operator float() const" (declared at line 217 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator short() const" (declared at line 235 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator unsigned short() const" (declared at line 238 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16 
  .hpp)
              function "__half::operator int() const" (declared at line 241 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)       
              function "__half::operator unsigned int() const" (declared at line 244 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.h 
  pp)
              function "__half::operator long long() const" (declared at line 247 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
              function "__half::operator unsigned long long() const" (declared at line 250 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_ 
  fp16.hpp)
              function "__half::operator __nv_bool() const" (declared at line 254 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
            ss[hiisg*tph + tiih] = (s4.x + s4.y + s4.z + s4.w);
                                           ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6261): error : no operator "+=" matches these operands [C:\proyects\llama.cpp\build\ggml.vcxproj]
              operand types are: half += half
                    s += ss[hiisg*tph + i];
                      ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6264): error : identifier "__hfma" is undefined [C:\proyects\llama.cpp\build\ggml.vcxproj]
                s = __hfma(s, __float2half(scale), mv);
                    ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6270): error : more than one conversion function from "half" to a built-in type applies: [C:\proyects\llama.cpp\build\ggml.vcxpro 
j]
              function "__half::operator float() const" (declared at line 217 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator short() const" (declared at line 235 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator unsigned short() const" (declared at line 238 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16 
  .hpp)
              function "__half::operator int() const" (declared at line 241 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)       
              function "__half::operator unsigned int() const" (declared at line 244 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.h 
  pp)
              function "__half::operator long long() const" (declared at line 247 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
              function "__half::operator unsigned long long() const" (declared at line 250 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_ 
  fp16.hpp)
              function "__half::operator __nv_bool() const" (declared at line 254 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
                half ms = hexp(m - M);
                               ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6270): error : more than one conversion function from "half" to a built-in type applies: [C:\proyects\llama.cpp\build\ggml.vcxpro 
j]
              function "__half::operator float() const" (declared at line 217 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator short() const" (declared at line 235 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator unsigned short() const" (declared at line 238 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16 
  .hpp)
              function "__half::operator int() const" (declared at line 241 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)       
              function "__half::operator unsigned int() const" (declared at line 244 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.h 
  pp)
              function "__half::operator long long() const" (declared at line 247 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
              function "__half::operator unsigned long long() const" (declared at line 250 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_ 
  fp16.hpp)
              function "__half::operator __nv_bool() const" (declared at line 254 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
                half ms = hexp(m - M);
                                   ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6270): error : identifier "hexp" is undefined [C:\proyects\llama.cpp\build\ggml.vcxproj]
                half ms = hexp(m - M);
                          ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6271): error : more than one conversion function from "half" to a built-in type applies: [C:\proyects\llama.cpp\build\ggml.vcxpro
j]
              function "__half::operator float() const" (declared at line 217 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator short() const" (declared at line 235 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator unsigned short() const" (declared at line 238 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16 
  .hpp)
              function "__half::operator int() const" (declared at line 241 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)       
              function "__half::operator unsigned int() const" (declared at line 244 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.h 
  pp)
              function "__half::operator long long() const" (declared at line 247 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
              function "__half::operator unsigned long long() const" (declared at line 250 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_ 
  fp16.hpp)
              function "__half::operator __nv_bool() const" (declared at line 254 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
                half vs = hexp(s - M);
                               ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6271): error : more than one conversion function from "half" to a built-in type applies: [C:\proyects\llama.cpp\build\ggml.vcxpro 
j]
              function "__half::operator float() const" (declared at line 217 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator short() const" (declared at line 235 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator unsigned short() const" (declared at line 238 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16 
  .hpp)
              function "__half::operator int() const" (declared at line 241 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)       
              function "__half::operator unsigned int() const" (declared at line 244 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.h 
  pp)
              function "__half::operator long long() const" (declared at line 247 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
              function "__half::operator unsigned long long() const" (declared at line 250 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_ 
  fp16.hpp)
              function "__half::operator __nv_bool() const" (declared at line 254 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
                half vs = hexp(s - M);
                                   ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6308): error : more than one conversion function from "half" to a built-in type applies: [C:\proyects\llama.cpp\build\ggml.vcxpro 
j]
              function "__half::operator float() const" (declared at line 217 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator short() const" (declared at line 235 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)     
              function "__half::operator unsigned short() const" (declared at line 238 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16 
  .hpp)
              function "__half::operator int() const" (declared at line 241 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp)       
              function "__half::operator unsigned int() const" (declared at line 244 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.h 
  pp)
              function "__half::operator long long() const" (declared at line 247 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
              function "__half::operator unsigned long long() const" (declared at line 250 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_ 
  fp16.hpp)
              function "__half::operator __nv_bool() const" (declared at line 254 of C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\cuda_fp16.hpp) 
                half ms0 = hexp(M0 - M);
                                ^
            detected during instantiation of "void flash_attn_ext_f16<D,R>(const char *, const char *, const char *, const char *, float *, float, int, int, int, in 
  t, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) [with D=80, R=2]" at line 10370

C:\proyects\llama.cpp\ggml-cuda.cu(6308): error : more than one conversion function from "half" to a built-in type applies: [C:\proyects\llama.cpp\build\ggml.vcxpro
j]

JohannesGaessler · 2024-01-24T18:09:24Z

The issue is that cmake by default compiles device code for compute capabilities 5.2, 6.1, and 7.0. You can fix it for now by compiling with -DLLAMA_CUDA_F16=ON.

ggml-cuda.cu

sorasoras · 2024-02-01T15:02:49Z

cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_FORCE_DMMV=TRUE -DLLAMA_CUDA_DMMV_X=32 -DLLAMA_CUDA_MMV_Y=2 -DLLAMA_CUDA_F16=TRUE -DBUILD_SHARED_LIBS=TRUE

.\test-flash-attention.exe
load_model: ggml tensor size    = 368 bytes
load_model: backend buffer size = 0.04 MB
load_model: using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: no
main: compute buffer size: 0.00 MB
computing time: 0.9200 ms

Performing test:
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Could I assumed this is unsupported on p40?

FSSRepo · 2024-02-01T15:18:42Z

@sorasoras To use flash attention, the library requires the Tensor Cores, which are present starting from the Volta architecture (Nvidia V100) and newer.

cuda : fix flash_attn kernel to produce same results as CPU

ggml-cuda.cu

FSSRepo · 2024-02-01T18:28:16Z

@ggerganov Have you tried conducting perplexity tests with the flash attention kernel? I've been experimenting with phi-2, and it's generating nonsensical text. Unfortunately, I don't have a llama model on my computer for testing, and I understand that the phi-2 model has precision issues.

ggerganov · 2024-02-01T18:58:25Z

phi-2 requires F32 kernel. The master implementation also uses F32 precision for the matrix multiplications:

https://github.com/ggerganov/llama.cpp/blob/32de3f57a97268e74a6e4dc1335b28ed7fee08db/llama.cpp#L4578-L4584

For now, test with LLaMA / Mistral models. When and if we get the performance, we will look into phi-2 support

ggml-cuda.cu

JohannesGaessler · 2024-02-03T09:46:24Z

On my hardware (1x RTX 3090) the FlashAttention kernel is suffering from tail effects for perplexity calculations with a batch size of 512. You can see this in NSight Compute under "PM Sampling":

The number of active warps for the last ~30% of the runtime is very low and this in turn reduces the GPU utilization. For more information you can look at the "Launch Statistics":

The number of waves per SM is 2.08. The actual runtime of a kernel will be proportional to this value rounded up to the next integer. Ideally you want this number to be just below a whole integer so that for all waves the GPU utilization is high. The worse-case scenario is something like 1.01 because then you have to do 2 waves when you could probably do a slightly modified version of the kernel in a single wave which would make it ~2x faster.

However, this is something that you should optimize at the end. The number of waves will depend strongly on implementation details, the specific GPU, and the batch size. The number of waves will be proportional to batch_size/(occupancy*num_streaming_multiprocessors). My immediate advice would be to optimize the other aspects of the kernel first but to keep tail effects in mind.

JohannesGaessler · 2024-02-03T10:28:29Z

I made a PR with some very simple performance optimizations because it took me the same time as it would have been writing Github comments: #4

JohannesGaessler

This code has a lot of if(...){...} statements. Try to avoid these if possible since they are slow on GPUs. If you cannot avoid them, try to provide the compiler with enough information via template parameters to resolve the branching at compile time so the runtime is unaffected.

JohannesGaessler · 2024-02-03T10:34:06Z

ggml-cuda.cu

+    const int warp_id = threadIdx.y;
+    const int lane_id = threadIdx.x;
+
+    const int num_warps = blockDim.y; // number of warps
+    const int iq3 = blockIdx.z;
+    const int iq2 = blockIdx.y;
+    const int iq1 = blockIdx.x * Q;


This code copies the values from threadIdx and blockIdx to registers. It should be very slightly faster to access them this way. But if register pressure is an issue it would make sense to remove these local variables and to instead use threadIdx and blockIdx directly.

If possible, make num_warps a template parameter so the compiler gets more information and you don't have to use a register for it.

FSSRepo · 2024-02-03T15:20:05Z

@JohannesGaessler

PM Sampling

I don't see the PM Sampling section in my Nsight Compute.

unroll 2 loops, int64_t -> int, 309 µs

JohannesGaessler · 2024-02-03T16:05:15Z

Under Connect -> Metrics I just ticked "full" and then I got it.

into flash-attn-cuda

FSSRepo · 2024-02-04T22:32:10Z

Under Connect -> Metrics I just ticked "full" and then I got it.

I don't see that section; it must be because my toolkit is outdated.

JohannesGaessler · 2024-02-11T08:58:44Z

At a batch size of 1 the size of the CUDA grid for the flash_attn_ext_f16 kernel is extremely small:

As a consequence both the memory and compute throughput are low. At a batch size of 512 the grid is larger but it is still very small:

There are only 1.25 waves per SM and as a consequence ~40% of the runtime is lost to tail effects:

If at all possible, the kernel should be rewritten in such a way that the work is distributed among more CUDA blocks or with more warps per CUDA block.

JohannesGaessler · 2024-02-11T09:16:15Z

ggml-cuda.cu

+    // reduce the warps sequentially
+    for (int sg = 1; sg < num_warps; ++sg) {
+        __syncthreads();
+
+        // each simdgroup stores its output to shared memory, reusing sq
+        if (warp_id == sg) {
+            for (int j = 0; j < Q16; ++j) {
+                for (int i = 0; i < D16; ++i) {
+                    nvcuda::wmma::store_matrix_sync(sq + 16*j*T + i*16, lo[j][i], T, nvcuda::wmma::mem_row_major);
+                }
+            }
+        }
+
+        __syncthreads();
+
+        // the first simdgroup accumulates the results from the other simdgroups
+        if (warp_id == 0) {
+            for (int j = lane_id; j < Q; j += NW) {
+                const half S0 = ss[j*T +         0];
+                const half S1 = ss[j*T + sg*SH + 0];
+
+                const half M0 = ss[j*T +         1];
+                const half M1 = ss[j*T + sg*SH + 1];
+
+                const half M = __hmax(M0, M1);
+
+                const half ms0 = hexp(M0 - M);
+                const half ms1 = hexp(M1 - M);
+
+                const half S = S0*ms0 + S1*ms1;
+
+                ss[j*T + 0] = S;
+                ss[j*T + 1] = M;
+
+                ss[j*T + C + j        ] = ms0;
+                ss[j*T + C + j + sg*SH] = ms1;
+            }
+
+            // O_0 = diag(ms0)*O_0 + diag(ms1)*O_1
+            for (int j = 0; j < Q16; ++j) {
+                half16x16_a ms0;
+                half16x16_a ms1;
+                half16x16_b t;
+                half16x16_acc t2;
+
+                nvcuda::wmma::load_matrix_sync(ms0, ss + 16*j*T + C + 16*j,         T);
+                nvcuda::wmma::load_matrix_sync(ms1, ss + 16*j*T + C + 16*j + sg*SH, T);
+
+                for (int i = 0; i < D16; ++i) {
+                    nvcuda::wmma::load_matrix_sync(t, sq + 16*j*T + i*16, T);
+                    nvcuda::wmma::mma_sync(t2, ms1, t, zr);
+
+                    // convert accumulator to matrix_b
+                    nvcuda::wmma::store_matrix_sync(   sq + 16*j*T + i*16, lo[j][i], T, nvcuda::wmma::mem_row_major);
+                    nvcuda::wmma::load_matrix_sync (t, sq + 16*j*T + i*16, T);
+
+                    nvcuda::wmma::mma_sync(lo[j][i], ms0, t, t2);
+                }
+            }
+        }
    }
+
+    // store result to shared memory (reuse sq)
+    if (warp_id == 0) {
+        for (int j = 0; j < Q16; ++j) {
+            for (int i = 0; i < D16; ++i) {
+                nvcuda::wmma::store_matrix_sync(sq + 16*j*T + i*16, lo[j][i], T, nvcuda::wmma::mem_row_major);
+            }
+        }
+    }
+
+    // final rescale with 1/S and store to global memory
+    if (warp_id == 0) {
+        for (int j = 0; j < Q && iq1 + j < ne01; ++j) {
+            const half S = ss[j*T + 0];
+
+            for (int i0 = 0; i0 < D; i0 += NW) {
+                const int i = i0 + lane_id;
+                if (i >= D) {
+                    break;
+                }
+
+                dst[(iq3*ne2*ne1 + iq2 + (iq1 + j)*ne1)*D + i] = __half2float(sq[j*T + i] / S);
+            }
+        }
+    }


My guess based on just static code analysis is that this is the part of the code that is causing issues when you increase the number of warps. But since I am not able to increase the number of warps (and still get correct results) this could be very wrong.

I think it works correctly for up to 8 warps - after that it runs out of shared memory

Okay, but how do I set the number of warps? If I just set e.g. nwarps = 4 on line 10940 I get NaN with

export model_name=llama_2-7b && export quantization=q4_0 ./perplexity --n-gpu-layers 99 --model models/nvme/${model_name}-${quantization}.gguf --file wikitext-2-raw/wiki.test.raw --mlock --chunks 1

after that it runs out of shared memory

It's implemented in a convoluted and tedious way but it is possible to raise the shared memory limit for CUDA kernels so that a single block can use 100% of the shared memory of an SM. For example:

https://github.com/JohannesGaessler/llama.cpp/blob/979a9bf1bef96fbeafd3a0a605c642ab098021ee/ggml-cuda.cu#L6723-L6735

I'm not sure about the state of this branch. It works on my branch:

https://github.com/ggerganov/llama.cpp/tree/gg/flash-attn

diff --git a/ggml-cuda.cu b/ggml-cuda.cu index 713a6a89..99925259 100644 --- a/ggml-cuda.cu +++ b/ggml-cuda.cu @@ -10932,7 +10932,7 @@ inline void ggml_cuda_flash_attn_ext(const ggml_tensor * Q, const ggml_tensor * const int nwarps_max = 8; // TODO: we don't want to launch too much warps. how much is too much? // TODO: produces wrong results for nwarps > 8 (RTX 2060) - not sure why - const int nwarps = Q->ne[1] <= nqpb ? std::max(2, std::min((int) K->ne[1]/ncpw, nwarps_max)) : 1; + const int nwarps = Q->ne[1] <= nqpb ? std::max(2, std::min((int) K->ne[1]/ncpw, nwarps_max)) : 4; dim3 blocks_num((Q->ne[1] + nqpb - 1) / nqpb, Q->ne[2], Q->ne[3]); dim3 block_dim(32, nwarps, 1);

LLAMA_CUBLAS=1 make -j && ./perplexity -m ./models/llama-7b-v2/ggml-model-q4_0.gguf -f ./build-cublas/wikitext-2-raw/wiki.test.raw -ngl 99

perplexity: calculating perplexity over 655 chunks, batch_size=512 perplexity: 0.36 seconds per pass - ETA 3.90 minutes [1]4.3401,[2]4.8393,[3]5.4619,[4]6.0629,[5]6.1965,[6]6.0945,[7]6.2706,^C

cuda: add flash attention + test

f7bcfb0

JohannesGaessler reviewed Jan 18, 2024

View reviewed changes

FSSRepo and others added 2 commits January 18, 2024 11:27

fix compilation

e53de28

ggml : add ggml_flash_attn_ext API

a1c004e

calvintwr mentioned this pull request Jan 19, 2024

ggml : add Flash Attention ggml-org/llama.cpp#5021

Merged

8 tasks

ggerganov and others added 17 commits January 19, 2024 20:06

ggml : fix GQA support in ggml_flash_attn_ext

fa7ebcc

Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp …

09db1a7

…into flash-attn-cuda

apply suggestions

fded2e6

Merge branch 'master' into gg/flash-attn

c3cdfff

ggml : online attention (CPU)

a9681fe

metal : initial implementation

1173f49

metal : f16 precision

528da75

metal : reduce branches

52ae085

metal : specialize for head size

b973258

wip : 8 rows per simd group

8cde449

wip : 4 rows per simd group

f31955f

wip : template for rows per warp

a4b6341

metal : parallelize across KV size

77d08f3

metal : parallel reduce across heads

17720fa

Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp …

a689b02

…into flash-attn-cuda

cuda: port metal version flash_attn_ext

6374bc5

fix equivalent fp16 math functions, compiler error 'undefined'

6416821

JohannesGaessler reviewed Jan 24, 2024

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda.cu Outdated Show resolved Hide resolved

ggerganov added 3 commits February 1, 2024 10:44

cuda : fix flash_attn kernel to produce same results as CPU

71b69aa

cuda : avoid extra QxQ matrix in shared memory

2c04bee

cuda : switch to F16 scalars + tune warps for RTX 2060

9a5c2a1

cuda : increase C to 128 for better performance

ac26f27

FSSRepo and others added 3 commits February 1, 2024 10:35

Merge pull request #3 from ggerganov/flash-attn-cuda

43f7156

cuda : fix flash_attn kernel to produce same results as CPU

fix mask nullptr

9240a84

don't require LLAMA_CUDA_F16 to compile

8d7a606

ggerganov reviewed Feb 1, 2024

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

FSSRepo added 2 commits February 1, 2024 13:02

#ifdef -> #if + fix check -inf

19e0b8e

cmake: remove unused changes

cae985c

refactor flash_attn function + improve tests

53621e3

ggerganov reviewed Feb 2, 2024

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

unroll 2 loops, int64_t -> int, 309 µs

674d5ac

JohannesGaessler reviewed Feb 3, 2024

View reviewed changes

Merge pull request #4 from Pints-App/jg/flash-attn-cuda

8b51ab4

unroll 2 loops, int64_t -> int, 309 µs

FSSRepo added 3 commits February 3, 2024 11:35

bring optimizations from gg/flash-attn

a1f9ffe

Merge branch 'flash-attn-cuda' of https://github.com/Pints-App/llama.cpp

ba7699d

into flash-attn-cuda

fix merge conflicts

f659f57

JohannesGaessler reviewed Feb 11, 2024

View reviewed changes

FSSRepo closed this Mar 6, 2024

WIP: Flash Attention implementation (forward + backward) #1

WIP: Flash Attention implementation (forward + backward) #1

Uh oh!

Conversation

FSSRepo commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Run test:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Jan 18, 2024

Uh oh!

JohannesGaessler commented Jan 18, 2024

Uh oh!

JohannesGaessler commented Jan 18, 2024

Uh oh!

JohannesGaessler commented Jan 18, 2024

Uh oh!

JohannesGaessler commented Jan 18, 2024

Uh oh!

FSSRepo commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jan 24, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sorasoras commented Feb 1, 2024

Uh oh!

FSSRepo commented Feb 1, 2024

Uh oh!

Uh oh!

FSSRepo commented Feb 1, 2024

Uh oh!

ggerganov commented Feb 1, 2024

Uh oh!

Uh oh!

JohannesGaessler commented Feb 3, 2024

Uh oh!

JohannesGaessler commented Feb 3, 2024

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Feb 3, 2024

Choose a reason for hiding this comment

Uh oh!

FSSRepo commented Feb 3, 2024

Uh oh!

JohannesGaessler commented Feb 3, 2024

Uh oh!

FSSRepo commented Feb 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Feb 11, 2024

Uh oh!

JohannesGaessler Feb 11, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 11, 2024

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Feb 11, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FSSRepo commented Jan 17, 2024 •

edited

Loading

FSSRepo commented Jan 24, 2024 •

edited

Loading

FSSRepo commented Feb 4, 2024 •

edited

Loading