whisper : support ggml_conv with CUDA and Metal #1473

ggerganov · 2023-11-10T14:53:25Z

Move the convolution to the GPU as well. The encoder is much faster now

GPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	tiny	1	8.85	1.86	4.31	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	tiny-q5_0	1	8.54	1.37	4.19	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	tiny-q5_1	1	8.46	1.33	4.22	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	base	1	14.90	2.55	5.87	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	base-q5_0	1	15.56	1.82	6.37	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	base-q5_1	1	15.16	1.78	5.94	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	small	1	40.54	4.77	12.61	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	small-q5_0	1	41.37	3.32	13.87	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	small-q5_1	1	41.32	3.34	13.31	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	medium	1	105.45	10.40	28.88	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	medium-q5_0	1	107.67	6.46	30.69	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	medium-q5_1	1	108.00	6.89	30.81	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	large	1	172.67	16.00	45.24	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	large-q5_0	1	177.31	8.93	49.94	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	large-q5_1	1	177.64	8.81	49.76	`9c1ddc7`

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
M2 Ultra	MacOS 14.1	COREML METAL	tiny	4	7.74	1.38	3.40	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	tiny-q5_0	4	6.61	1.37	3.19	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	tiny-q5_1	4	7.32	1.39	3.03	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	base	4	12.51	2.00	4.61	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	base-q5_0	4	11.82	1.91	4.73	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	base-q5_1	4	11.62	1.94	4.79	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	small	4	32.00	3.92	12.12	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	small-q5_0	4	33.15	3.89	13.73	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	small-q5_1	4	33.28	3.91	13.64	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	medium	4	93.84	8.26	30.16	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	medium-q5_0	4	96.74	7.99	33.90	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	medium-q5_1	4	96.46	8.12	33.67	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	large	4	179.61	11.72	53.73	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	large-q5_0	4	185.15	11.77	62.17	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	large-q5_1	4	185.08	11.69	61.98	`997f7cb`

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
M2 Ultra	MacOS 14.1	METAL	tiny	4	12.47	1.37	3.08	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	tiny-q5_0	4	12.16	1.34	2.91	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	tiny-q5_1	4	12.46	1.37	2.93	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	tiny-q8_0	4	10.84	1.32	2.81	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base	4	17.90	1.93	4.53	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base-q5_0	4	19.77	1.93	4.71	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base-q5_1	4	19.73	1.91	4.69	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base-q8_0	4	18.83	1.89	4.63	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small	4	50.79	3.97	12.13	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q4_0	4	53.50	3.69	12.88	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q4_1	4	53.41	3.66	12.88	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q5_0	4	57.16	3.95	13.70	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q5_1	4	56.82	3.97	13.62	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q8_0	4	53.14	3.73	12.97	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium	4	138.55	8.28	30.04	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q4_0	4	147.26	7.26	31.62	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q4_1	4	147.48	7.52	31.76	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q5_0	4	159.11	8.02	33.83	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q5_1	4	158.79	8.14	33.66	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q8_0	4	146.50	7.82	32.16	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large	4	247.72	11.71	53.67	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q4_0	4	263.48	10.62	57.08	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q4_1	4	262.32	10.56	57.09	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q5_0	4	285.42	11.84	62.21	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q5_1	4	284.08	11.65	62.00	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q8_0	4	262.82	11.29	57.51	`997f7cb`

ggerganov · 2023-11-10T15:06:04Z

whisper.cpp

+            //cur = ggml_add(ctx0, cur, model.e_conv_2_b);
            cur = ggml_add(ctx0,
                    ggml_repeat(ctx0,
                        model.e_conv_2_b,


@slaren

I think I hit some weird bug here. On this branch I offloaded everything on the GPU when using CUDA, including the convolutions using the implementation from ggml-org/ggml#564

Additionally, I eliminated the two ggml_repeat here by pre-broadcasting the e_conv_1_b and e_conv_2_b tensors upon load:

https://github.com/ggerganov/whisper.cpp/blob/000b952c2db307c499d09b9c6369ecce44034c47/whisper.cpp#L1490-L1507

Everything works on the CPU and the GPU with the implementation that is currently on the branch.
However, when I apply the following diff to remove the ggml_repeat it breaks with CUDA:

diff --git a/whisper.cpp b/whisper.cpp index 1371a6c..80ca5c9 100644 --- a/whisper.cpp +++ b/whisper.cpp @@ -1604,22 +1604,22 @@ static struct ggml_cgraph * whisper_build_graph_conv( // convolution + gelu { cur = ggml_conv_1d_ph(ctx0, model.e_conv_1_w, mel, 1, 1); - //cur = ggml_add(ctx0, cur, model.e_conv_1_b); - cur = ggml_add(ctx0, - ggml_repeat(ctx0, - model.e_conv_1_b, - cur), - cur); + cur = ggml_add(ctx0, cur, model.e_conv_1_b); + //cur = ggml_add(ctx0, + // ggml_repeat(ctx0, + // model.e_conv_1_b, + // cur), + // cur); cur = ggml_gelu(ctx0, cur); cur = ggml_conv_1d_ph(ctx0, model.e_conv_2_w, cur, 2, 1); - //cur = ggml_add(ctx0, cur, model.e_conv_2_b); - cur = ggml_add(ctx0, - ggml_repeat(ctx0, - model.e_conv_2_b, - cur), - cur); + cur = ggml_add(ctx0, cur, model.e_conv_2_b); + //cur = ggml_add(ctx0, + // ggml_repeat(ctx0, + // model.e_conv_2_b, + // cur), + // cur); cur = ggml_gelu(ctx0, cur); }

Without GPU offloading (-ng):

WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav -ng

WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav -ng I whisper.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 make: Nothing to be done for 'default'. whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5 whisper_model_load: CPU buffer size = 149.41 MB whisper_model_load: model size = 149.32 MB whisper_init_state: kv self size = 5.25 MB whisper_init_state: kv cross size = 17.58 MB whisper_init_state: compute buffer (conv) = 18.50 MB whisper_init_state: compute buffer (encode) = 81.95 MB whisper_init_state: compute buffer (cross) = 4.49 MB whisper_init_state: compute buffer (decode) = 24.70 MB system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | main: processing 'samples/gb0.wav' (2037760 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:03.240] Good morning. This Tuesday is Election Day. [00:00:03.240 --> 00:00:06.000] After months of spirited debate and vigorous campaigning, [00:00:06.000 --> 00:00:08.640] the time has come for Americans to make important decisions [00:00:08.640 --> 00:00:10.120] about our nation's future. [00:00:10.120 --> 00:00:13.760] I encourage all Americans to go to the polls and vote. [00:00:13.760 --> 00:00:16.120] Election season brings out the spirit of competition [00:00:16.120 --> 00:00:18.080] between our political parties. [00:00:18.080 --> 00:00:20.260] And that competition is an essential part [00:00:20.260 --> 00:00:21.760] of a healthy democracy. [00:00:21.760 --> 00:00:23.520] But as the campaigns come to a close, [00:00:23.520 --> 00:00:26.000] Republicans, Democrats, and independents [00:00:26.000 --> 00:00:29.120] can find common ground on at least one point. [00:00:29.120 --> 00:00:31.560] Our system of representative democracy [00:00:31.560 --> 00:00:34.440] is one of America's greatest strengths. [00:00:34.440 --> 00:00:36.240] The United States was founded on the belief [00:00:36.240 --> 00:00:38.240] that all men are created equal. [00:00:38.240 --> 00:00:41.440] Every election day, millions of Americans of all races, [00:00:41.440 --> 00:00:43.440] religions, and backgrounds step into voting [00:00:43.440 --> 00:00:45.280] booths throughout the nation. [00:00:45.280 --> 00:00:47.780] Whether they are richer, poor, old, or young, [00:00:47.780 --> 00:00:50.680] each of them has an equal share in choosing the path [00:00:50.680 --> 00:00:52.440] that our country will take. [00:00:52.440 --> 00:00:54.920] And every ballot they cast is a reminder [00:00:54.920 --> 00:00:58.280] that our founding principles are alive and well. [00:00:58.280 --> 00:00:59.760] Voting is one of the great privileges [00:00:59.760 --> 00:01:01.760] of American citizenship. [00:01:01.760 --> 00:01:04.520] And it has always required brave defenders. [00:01:04.520 --> 00:01:06.040] As you head to the polls next week, [00:01:06.040 --> 00:01:09.280] remember the sacrifices that have been made by generations [00:01:09.280 --> 00:01:13.000] of Americans in uniform to preserve our way of life. [00:01:13.000 --> 00:01:15.480] From Bunker Hill to Baghdad, the men and women [00:01:15.480 --> 00:01:18.160] of American armed forces have been devoted guardians [00:01:18.160 --> 00:01:19.960] of our democracy. [00:01:19.960 --> 00:01:21.800] All of us owe them and their families [00:01:21.800 --> 00:01:25.240] a special debt of gratitude on Election Day. [00:01:25.240 --> 00:01:27.560] Americans should also remember the important example [00:01:27.560 --> 00:01:30.080] that our election set throughout the world. [00:01:30.080 --> 00:01:32.080] Young democracies from Georgia and Ukraine [00:01:32.080 --> 00:01:34.560] to Afghanistan and Iraq can look to the United States [00:01:34.560 --> 00:01:37.560] for proof that self-government can endure. [00:01:37.560 --> 00:01:40.400] And nations that still live under tyranny and oppression [00:01:40.400 --> 00:01:44.080] can find hope and inspiration in our commitment to liberty. [00:01:44.080 --> 00:01:45.720] For more than two centuries, Americans [00:01:45.720 --> 00:01:47.800] have demonstrated the ability of free people [00:01:47.800 --> 00:01:49.600] to choose their own leaders. [00:01:49.600 --> 00:01:51.880] Our nation has flourished because of its commitment [00:01:51.880 --> 00:01:54.640] to trusting the wisdom of our citizenry. [00:01:54.640 --> 00:01:57.200] In this year's election, we will see this tradition [00:01:57.200 --> 00:02:00.280] continue, and we will be reminded once again [00:02:00.280 --> 00:02:02.640] that we are blessed to live in a free nation [00:02:02.640 --> 00:02:05.520] guided by the will of the people. [00:02:05.520 --> 00:02:06.960] Thank you for listening. whisper_print_timings: load time = 161.71 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 77.16 ms whisper_print_timings: sample time = 175.36 ms / 532 runs ( 0.33 ms per run) whisper_print_timings: encode time = 2447.21 ms / 5 runs ( 489.44 ms per run) whisper_print_timings: decode time = 1714.88 ms / 528 runs ( 3.25 ms per run) whisper_print_timings: prompt time = 356.03 ms / 4 runs ( 89.01 ms per run) whisper_print_timings: total time = 4943.60 ms

With GPU offloading:

WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav

WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav I whisper.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 make: Nothing to be done for 'default'. whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5 whisper_model_load: using CUDA backend whisper_model_load: CUDA buffer size = 149.41 MB whisper_model_load: model size = 149.32 MB whisper_init_state: kv self size = 5.25 MB whisper_init_state: kv cross size = 17.58 MB whisper_init_state: compute buffer (conv) = 14.11 MB whisper_init_state: compute buffer (encode) = 81.95 MB whisper_init_state: compute buffer (cross) = 4.49 MB whisper_init_state: compute buffer (decode) = 24.70 MB system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | main: processing 'samples/gb0.wav' (2037760 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:03.240] Good morning. This Tuesday is Election Day. [00:00:03.240 --> 00:00:06.000] After months of spirited debate and vigorous campaigning, [00:00:06.000 --> 00:00:08.640] the time has come for Americans to make important decisions [00:00:08.640 --> 00:00:10.120] about our nation's future. [00:00:10.120 --> 00:00:13.760] I encourage all Americans to go to the polls and vote. [00:00:13.760 --> 00:00:16.120] Election season brings out the spirit of competition [00:00:16.120 --> 00:00:18.080] between our political parties. [00:00:18.080 --> 00:00:20.260] And that competition is an essential part [00:00:20.260 --> 00:00:21.760] of a healthy democracy. [00:00:21.760 --> 00:00:23.520] But as the campaigns come to a close, [00:00:23.520 --> 00:00:26.000] Republicans, Democrats, and independents [00:00:26.000 --> 00:00:29.120] can find common ground on at least one point. [00:-16:-18.-140 --> 00:00:59.120] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [00:-15:-48.-140 --> 00:01:29.120] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Since the bias tensors are already broadcasted upon load, the diff should not lead to any difference in the results. Also, I can remove either one of the ggml_repeat and it still works. It only breaks when both of them are removed.

Any ideas?

Where can I find the gb0.wav sample? With rfk.wav it seems to work.

'make samples'

I can't reproduce this reliably, but it happens sometimes. I suspect that the cause may be that some operation depends on the contents of the memory being cleared.

This change clears dst before executing the op if it is not inplace. Can you test if this fixes the issue for you?

diff --git a/ggml-cuda.cu b/ggml-cuda.cu index 2212144..2ab2ab8 100644 --- a/ggml-cuda.cu +++ b/ggml-cuda.cu @@ -8259,6 +8259,18 @@ static void ggml_backend_cuda_graph_compute(ggml_backend_t backend, ggml_cgraph } } + bool inplace = false; + for (int j = 0; j < GGML_MAX_SRC; j++) { + if (node->src[j] != nullptr && node->src[j]->data >= node->data && node->src[j]->data < (char *)node->data + ggml_nbytes(node)) { + inplace = true; + break; + } + } + if (!inplace) { + CUDA_CHECK(cudaMemsetAsync(node->data, 0x00, ggml_nbytes(node), g_cudaStreams[g_main_device][0])); // ok + //CUDA_CHECK(cudaMemsetAsync(node->data, 0xFA, ggml_nbytes(node), g_cudaStreams[g_main_device][0])); // fail + } + bool ok = ggml_cuda_compute_forward(&params, node); if (!ok) { fprintf(stderr, "%s: error: op not supported %s (%s)\n", __func__, node->name, ggml_op_name(node->op));

Yup, it fixes the issue. Will look which operation could be causing this. Might be something related to the new ggml_conv implementation

Thanks for the help. I think I found a fix for the im2col kernel. Will be doing some more tests

It looks pretty good, should I apply these changes in the ggml PR, or do you want to do it? Honestly, I hadn't thought of that way to save cudaMemset. I hope that backend v2 helps me fix the issue I'm having with stable diffusion when loading everything on the GPU. All operations seem correct, but I'm getting NaN in the output for some reason.

Hi, I will apply the changes soon. I'm currently implementing im2col for Metal and will use this PR to test that it works.

Sorry to hear about the NaN issues - it's quite difficult to debug. Don't think v2 would be of much help though, but we'll see. Does applying the fix from this PR help?

There is also this fix, which might or might not be relevant to SD:

ggml-org/ggml@439a79f#diff-d31fbcb763417dd283c99fff7473e7ac9cde20bd7f9b3d04bbedb16346f4a2d9R6517-R6519

I am already 100% sure that it is not the operations (kernels are correct) causing the issue (using CUDA backend), as when using the CPU backend but performing the complete computation with CUDA (fallback), the results are correct. The only differing factor is the memory handling.

That doesn't really prove that the kernels are fine, they may depend on some pre-conditions that are only true in some specific cases. That is what was happening here with the imcol kernel. In any case, I will add tests and debugging tools in the next ggml-backend update that will make diagnosing these issues easier.

* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state

* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggml-org#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggml-org#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state

ggerganov added 2 commits November 10, 2023 15:10

ggml : add CUDA support for ggml_conv

8150626

whisper : remove ggml_repeat for conv bias + single backend

000b952

ggerganov commented Nov 10, 2023

View reviewed changes

ggerganov added 3 commits November 10, 2023 19:39

cuda : fix im2col kernel

9c1ddc7

metal : add im2col support + mul mat-vec f16 x f16

997f7cb

bench-all : add q4 models

7a91a3b

ggerganov marked this pull request as ready for review November 10, 2023 20:23

ggerganov changed the title ~~whisper : support ggml_conv with CUDA~~ whisper : support ggml_conv with CUDA and Metal Nov 10, 2023

ggerganov merged commit 933c5be into ggml-backend-no-sched Nov 10, 2023

ggerganov mentioned this pull request Nov 10, 2023

whisper : add full CUDA and Metal offloading #1472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

whisper : support ggml_conv with CUDA and Metal #1473

whisper : support ggml_conv with CUDA and Metal #1473

Uh oh!

ggerganov commented Nov 10, 2023 •

edited

Loading

Uh oh!

ggerganov Nov 10, 2023

Uh oh!

slaren Nov 10, 2023

Uh oh!

ggerganov Nov 10, 2023

Uh oh!

slaren Nov 10, 2023 •

edited

Loading

Uh oh!

ggerganov Nov 10, 2023

Uh oh!

ggerganov Nov 10, 2023

Uh oh!

FSSRepo Nov 10, 2023 •

edited

Loading

Uh oh!

ggerganov Nov 10, 2023

Uh oh!

FSSRepo Nov 10, 2023 •

edited

Loading

Uh oh!

slaren Nov 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

whisper : support ggml_conv with CUDA and Metal #1473

whisper : support ggml_conv with CUDA and Metal #1473

Uh oh!

Conversation

ggerganov commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

slaren Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

slaren Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

FSSRepo Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

FSSRepo Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Nov 10, 2023 •

edited

Loading

slaren Nov 10, 2023 •

edited

Loading

FSSRepo Nov 10, 2023 •

edited

Loading

FSSRepo Nov 10, 2023 •

edited

Loading