Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 10, 2023

ref #1472

Move the convolution to the GPU as well. The encoder is much faster now

GPU OS Config Model Th Enc. Dec. PP Commit
NVIDIA V100 Ubuntu AVX2 BLAS CUDA tiny 1 8.85 1.86 4.31 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA tiny-q5_0 1 8.54 1.37 4.19 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA tiny-q5_1 1 8.46 1.33 4.22 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA base 1 14.90 2.55 5.87 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA base-q5_0 1 15.56 1.82 6.37 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA base-q5_1 1 15.16 1.78 5.94 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA small 1 40.54 4.77 12.61 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA small-q5_0 1 41.37 3.32 13.87 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA small-q5_1 1 41.32 3.34 13.31 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA medium 1 105.45 10.40 28.88 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA medium-q5_0 1 107.67 6.46 30.69 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA medium-q5_1 1 108.00 6.89 30.81 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA large 1 172.67 16.00 45.24 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA large-q5_0 1 177.31 8.93 49.94 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA large-q5_1 1 177.64 8.81 49.76 9c1ddc7
CPU OS Config Model Th Enc. Dec. PP Commit
M2 Ultra MacOS 14.1 COREML METAL tiny 4 7.74 1.38 3.40 997f7cb
M2 Ultra MacOS 14.1 COREML METAL tiny-q5_0 4 6.61 1.37 3.19 997f7cb
M2 Ultra MacOS 14.1 COREML METAL tiny-q5_1 4 7.32 1.39 3.03 997f7cb
M2 Ultra MacOS 14.1 COREML METAL base 4 12.51 2.00 4.61 997f7cb
M2 Ultra MacOS 14.1 COREML METAL base-q5_0 4 11.82 1.91 4.73 997f7cb
M2 Ultra MacOS 14.1 COREML METAL base-q5_1 4 11.62 1.94 4.79 997f7cb
M2 Ultra MacOS 14.1 COREML METAL small 4 32.00 3.92 12.12 997f7cb
M2 Ultra MacOS 14.1 COREML METAL small-q5_0 4 33.15 3.89 13.73 997f7cb
M2 Ultra MacOS 14.1 COREML METAL small-q5_1 4 33.28 3.91 13.64 997f7cb
M2 Ultra MacOS 14.1 COREML METAL medium 4 93.84 8.26 30.16 997f7cb
M2 Ultra MacOS 14.1 COREML METAL medium-q5_0 4 96.74 7.99 33.90 997f7cb
M2 Ultra MacOS 14.1 COREML METAL medium-q5_1 4 96.46 8.12 33.67 997f7cb
M2 Ultra MacOS 14.1 COREML METAL large 4 179.61 11.72 53.73 997f7cb
M2 Ultra MacOS 14.1 COREML METAL large-q5_0 4 185.15 11.77 62.17 997f7cb
M2 Ultra MacOS 14.1 COREML METAL large-q5_1 4 185.08 11.69 61.98 997f7cb
CPU OS Config Model Th Enc. Dec. PP Commit
M2 Ultra MacOS 14.1 METAL tiny 4 12.47 1.37 3.08 997f7cb
M2 Ultra MacOS 14.1 METAL tiny-q5_0 4 12.16 1.34 2.91 997f7cb
M2 Ultra MacOS 14.1 METAL tiny-q5_1 4 12.46 1.37 2.93 997f7cb
M2 Ultra MacOS 14.1 METAL tiny-q8_0 4 10.84 1.32 2.81 997f7cb
M2 Ultra MacOS 14.1 METAL base 4 17.90 1.93 4.53 997f7cb
M2 Ultra MacOS 14.1 METAL base-q5_0 4 19.77 1.93 4.71 997f7cb
M2 Ultra MacOS 14.1 METAL base-q5_1 4 19.73 1.91 4.69 997f7cb
M2 Ultra MacOS 14.1 METAL base-q8_0 4 18.83 1.89 4.63 997f7cb
M2 Ultra MacOS 14.1 METAL small 4 50.79 3.97 12.13 997f7cb
M2 Ultra MacOS 14.1 METAL small-q4_0 4 53.50 3.69 12.88 997f7cb
M2 Ultra MacOS 14.1 METAL small-q4_1 4 53.41 3.66 12.88 997f7cb
M2 Ultra MacOS 14.1 METAL small-q5_0 4 57.16 3.95 13.70 997f7cb
M2 Ultra MacOS 14.1 METAL small-q5_1 4 56.82 3.97 13.62 997f7cb
M2 Ultra MacOS 14.1 METAL small-q8_0 4 53.14 3.73 12.97 997f7cb
M2 Ultra MacOS 14.1 METAL medium 4 138.55 8.28 30.04 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q4_0 4 147.26 7.26 31.62 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q4_1 4 147.48 7.52 31.76 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q5_0 4 159.11 8.02 33.83 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q5_1 4 158.79 8.14 33.66 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q8_0 4 146.50 7.82 32.16 997f7cb
M2 Ultra MacOS 14.1 METAL large 4 247.72 11.71 53.67 997f7cb
M2 Ultra MacOS 14.1 METAL large-q4_0 4 263.48 10.62 57.08 997f7cb
M2 Ultra MacOS 14.1 METAL large-q4_1 4 262.32 10.56 57.09 997f7cb
M2 Ultra MacOS 14.1 METAL large-q5_0 4 285.42 11.84 62.21 997f7cb
M2 Ultra MacOS 14.1 METAL large-q5_1 4 284.08 11.65 62.00 997f7cb
M2 Ultra MacOS 14.1 METAL large-q8_0 4 262.82 11.29 57.51 997f7cb

whisper.cpp Outdated
//cur = ggml_add(ctx0, cur, model.e_conv_2_b);
cur = ggml_add(ctx0,
ggml_repeat(ctx0,
model.e_conv_2_b,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren

I think I hit some weird bug here. On this branch I offloaded everything on the GPU when using CUDA, including the convolutions using the implementation from ggml-org/ggml#564

Additionally, I eliminated the two ggml_repeat here by pre-broadcasting the e_conv_1_b and e_conv_2_b tensors upon load:

https://github.com/ggerganov/whisper.cpp/blob/000b952c2db307c499d09b9c6369ecce44034c47/whisper.cpp#L1490-L1507

Everything works on the CPU and the GPU with the implementation that is currently on the branch.
However, when I apply the following diff to remove the ggml_repeat it breaks with CUDA:

diff --git a/whisper.cpp b/whisper.cpp
index 1371a6c..80ca5c9 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -1604,22 +1604,22 @@ static struct ggml_cgraph * whisper_build_graph_conv(
         // convolution + gelu
         {
             cur = ggml_conv_1d_ph(ctx0, model.e_conv_1_w, mel, 1, 1);
-            //cur = ggml_add(ctx0, cur, model.e_conv_1_b);
-            cur = ggml_add(ctx0,
-                    ggml_repeat(ctx0,
-                        model.e_conv_1_b,
-                        cur),
-                    cur);
+            cur = ggml_add(ctx0, cur, model.e_conv_1_b);
+            //cur = ggml_add(ctx0,
+            //        ggml_repeat(ctx0,
+            //            model.e_conv_1_b,
+            //            cur),
+            //        cur);
 
             cur = ggml_gelu(ctx0, cur);
 
             cur = ggml_conv_1d_ph(ctx0, model.e_conv_2_w, cur, 2, 1);
-            //cur = ggml_add(ctx0, cur, model.e_conv_2_b);
-            cur = ggml_add(ctx0,
-                    ggml_repeat(ctx0,
-                        model.e_conv_2_b,
-                        cur),
-                    cur);
+            cur = ggml_add(ctx0, cur, model.e_conv_2_b);
+            //cur = ggml_add(ctx0,
+            //        ggml_repeat(ctx0,
+            //            model.e_conv_2_b,
+            //            cur),
+            //        cur);
 
             cur = ggml_gelu(ctx0, cur);
         }

Without GPU offloading (-ng):

WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav -ng
WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav -ng
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make: Nothing to be done for 'default'.
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
whisper_model_load:      CPU buffer size =   149.41 MB
whisper_model_load: model size    =  149.32 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   18.50 MB
whisper_init_state: compute buffer (encode) =   81.95 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/gb0.wav' (2037760 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.240]   Good morning. This Tuesday is Election Day.
[00:00:03.240 --> 00:00:06.000]   After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.640]   the time has come for Americans to make important decisions
[00:00:08.640 --> 00:00:10.120]   about our nation's future.
[00:00:10.120 --> 00:00:13.760]   I encourage all Americans to go to the polls and vote.
[00:00:13.760 --> 00:00:16.120]   Election season brings out the spirit of competition
[00:00:16.120 --> 00:00:18.080]   between our political parties.
[00:00:18.080 --> 00:00:20.260]   And that competition is an essential part
[00:00:20.260 --> 00:00:21.760]   of a healthy democracy.
[00:00:21.760 --> 00:00:23.520]   But as the campaigns come to a close,
[00:00:23.520 --> 00:00:26.000]   Republicans, Democrats, and independents
[00:00:26.000 --> 00:00:29.120]   can find common ground on at least one point.
[00:00:29.120 --> 00:00:31.560]   Our system of representative democracy
[00:00:31.560 --> 00:00:34.440]   is one of America's greatest strengths.
[00:00:34.440 --> 00:00:36.240]   The United States was founded on the belief
[00:00:36.240 --> 00:00:38.240]   that all men are created equal.
[00:00:38.240 --> 00:00:41.440]   Every election day, millions of Americans of all races,
[00:00:41.440 --> 00:00:43.440]   religions, and backgrounds step into voting
[00:00:43.440 --> 00:00:45.280]   booths throughout the nation.
[00:00:45.280 --> 00:00:47.780]   Whether they are richer, poor, old, or young,
[00:00:47.780 --> 00:00:50.680]   each of them has an equal share in choosing the path
[00:00:50.680 --> 00:00:52.440]   that our country will take.
[00:00:52.440 --> 00:00:54.920]   And every ballot they cast is a reminder
[00:00:54.920 --> 00:00:58.280]   that our founding principles are alive and well.
[00:00:58.280 --> 00:00:59.760]   Voting is one of the great privileges
[00:00:59.760 --> 00:01:01.760]   of American citizenship.
[00:01:01.760 --> 00:01:04.520]   And it has always required brave defenders.
[00:01:04.520 --> 00:01:06.040]   As you head to the polls next week,
[00:01:06.040 --> 00:01:09.280]   remember the sacrifices that have been made by generations
[00:01:09.280 --> 00:01:13.000]   of Americans in uniform to preserve our way of life.
[00:01:13.000 --> 00:01:15.480]   From Bunker Hill to Baghdad, the men and women
[00:01:15.480 --> 00:01:18.160]   of American armed forces have been devoted guardians
[00:01:18.160 --> 00:01:19.960]   of our democracy.
[00:01:19.960 --> 00:01:21.800]   All of us owe them and their families
[00:01:21.800 --> 00:01:25.240]   a special debt of gratitude on Election Day.
[00:01:25.240 --> 00:01:27.560]   Americans should also remember the important example
[00:01:27.560 --> 00:01:30.080]   that our election set throughout the world.
[00:01:30.080 --> 00:01:32.080]   Young democracies from Georgia and Ukraine
[00:01:32.080 --> 00:01:34.560]   to Afghanistan and Iraq can look to the United States
[00:01:34.560 --> 00:01:37.560]   for proof that self-government can endure.
[00:01:37.560 --> 00:01:40.400]   And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080]   can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.720]   For more than two centuries, Americans
[00:01:45.720 --> 00:01:47.800]   have demonstrated the ability of free people
[00:01:47.800 --> 00:01:49.600]   to choose their own leaders.
[00:01:49.600 --> 00:01:51.880]   Our nation has flourished because of its commitment
[00:01:51.880 --> 00:01:54.640]   to trusting the wisdom of our citizenry.
[00:01:54.640 --> 00:01:57.200]   In this year's election, we will see this tradition
[00:01:57.200 --> 00:02:00.280]   continue, and we will be reminded once again
[00:02:00.280 --> 00:02:02.640]   that we are blessed to live in a free nation
[00:02:02.640 --> 00:02:05.520]   guided by the will of the people.
[00:02:05.520 --> 00:02:06.960]   Thank you for listening.


whisper_print_timings:     load time =   161.71 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    77.16 ms
whisper_print_timings:   sample time =   175.36 ms /   532 runs (    0.33 ms per run)
whisper_print_timings:   encode time =  2447.21 ms /     5 runs (  489.44 ms per run)
whisper_print_timings:   decode time =  1714.88 ms /   528 runs (    3.25 ms per run)
whisper_print_timings:   prompt time =   356.03 ms /     4 runs (   89.01 ms per run)
whisper_print_timings:    total time =  4943.60 ms

With GPU offloading:

WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav
WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make: Nothing to be done for 'default'.
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
whisper_model_load: using CUDA backend
whisper_model_load:     CUDA buffer size =   149.41 MB
whisper_model_load: model size    =  149.32 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   14.11 MB
whisper_init_state: compute buffer (encode) =   81.95 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/gb0.wav' (2037760 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.240]   Good morning. This Tuesday is Election Day.
[00:00:03.240 --> 00:00:06.000]   After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.640]   the time has come for Americans to make important decisions
[00:00:08.640 --> 00:00:10.120]   about our nation's future.
[00:00:10.120 --> 00:00:13.760]   I encourage all Americans to go to the polls and vote.
[00:00:13.760 --> 00:00:16.120]   Election season brings out the spirit of competition
[00:00:16.120 --> 00:00:18.080]   between our political parties.
[00:00:18.080 --> 00:00:20.260]   And that competition is an essential part
[00:00:20.260 --> 00:00:21.760]   of a healthy democracy.
[00:00:21.760 --> 00:00:23.520]   But as the campaigns come to a close,
[00:00:23.520 --> 00:00:26.000]   Republicans, Democrats, and independents
[00:00:26.000 --> 00:00:29.120]   can find common ground on at least one point.
[00:-16:-18.-140 --> 00:00:59.120]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[00:-15:-48.-140 --> 00:01:29.120]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Since the bias tensors are already broadcasted upon load, the diff should not lead to any difference in the results. Also, I can remove either one of the ggml_repeat and it still works. It only breaks when both of them are removed.

Any ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find the gb0.wav sample? With rfk.wav it seems to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'make samples'

Copy link
Member

@slaren slaren Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce this reliably, but it happens sometimes. I suspect that the cause may be that some operation depends on the contents of the memory being cleared.

This change clears dst before executing the op if it is not inplace. Can you test if this fixes the issue for you?

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 2212144..2ab2ab8 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -8259,6 +8259,18 @@ static void ggml_backend_cuda_graph_compute(ggml_backend_t backend, ggml_cgraph
             }
         }

+        bool inplace = false;
+        for (int j = 0; j < GGML_MAX_SRC; j++) {
+            if (node->src[j] != nullptr && node->src[j]->data >= node->data && node->src[j]->data < (char *)node->data + ggml_nbytes(node)) {
+                inplace = true;
+                break;
+            }
+        }
+        if (!inplace) {
+            CUDA_CHECK(cudaMemsetAsync(node->data, 0x00, ggml_nbytes(node), g_cudaStreams[g_main_device][0])); // ok
+            //CUDA_CHECK(cudaMemsetAsync(node->data, 0xFA, ggml_nbytes(node), g_cudaStreams[g_main_device][0])); // fail
+        }
+
         bool ok = ggml_cuda_compute_forward(&params, node);
         if (!ok) {
             fprintf(stderr, "%s: error: op not supported %s (%s)\n", __func__, node->name, ggml_op_name(node->op));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it fixes the issue. Will look which operation could be causing this. Might be something related to the new ggml_conv implementation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help. I think I found a fix for the im2col kernel. Will be doing some more tests

Copy link

@FSSRepo FSSRepo Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty good, should I apply these changes in the ggml PR, or do you want to do it? Honestly, I hadn't thought of that way to save cudaMemset. I hope that backend v2 helps me fix the issue I'm having with stable diffusion when loading everything on the GPU. All operations seem correct, but I'm getting NaN in the output for some reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I will apply the changes soon. I'm currently implementing im2col for Metal and will use this PR to test that it works.

Sorry to hear about the NaN issues - it's quite difficult to debug. Don't think v2 would be of much help though, but we'll see. Does applying the fix from this PR help?

There is also this fix, which might or might not be relevant to SD:

ggml-org/ggml@439a79f#diff-d31fbcb763417dd283c99fff7473e7ac9cde20bd7f9b3d04bbedb16346f4a2d9R6517-R6519

Copy link

@FSSRepo FSSRepo Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am already 100% sure that it is not the operations (kernels are correct) causing the issue (using CUDA backend), as when using the CPU backend but performing the complete computation with CUDA (fallback), the results are correct. The only differing factor is the memory handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't really prove that the kernels are fine, they may depend on some pre-conditions that are only true in some specific cases. That is what was happening here with the imcol kernel. In any case, I will add tests and debugging tools in the next ggml-backend update that will make diagnosing these issues easier.

@ggerganov ggerganov marked this pull request as ready for review November 10, 2023 20:23
@ggerganov ggerganov changed the title whisper : support ggml_conv with CUDA whisper : support ggml_conv with CUDA and Metal Nov 10, 2023
@ggerganov ggerganov merged commit 933c5be into ggml-backend-no-sched Nov 10, 2023
ggerganov added a commit that referenced this pull request Nov 12, 2023
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
felrock pushed a commit to felrock/whisper.cpp that referenced this pull request Nov 18, 2023
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (ggml-org#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (ggml-org#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (ggml-org#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (ggml-org#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (ggml-org#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (ggml-org#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants