-
Couldn't load subscription status.
- Fork 4.9k
whisper : support ggml_conv with CUDA and Metal #1473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
whisper.cpp
Outdated
| //cur = ggml_add(ctx0, cur, model.e_conv_2_b); | ||
| cur = ggml_add(ctx0, | ||
| ggml_repeat(ctx0, | ||
| model.e_conv_2_b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I hit some weird bug here. On this branch I offloaded everything on the GPU when using CUDA, including the convolutions using the implementation from ggml-org/ggml#564
Additionally, I eliminated the two ggml_repeat here by pre-broadcasting the e_conv_1_b and e_conv_2_b tensors upon load:
Everything works on the CPU and the GPU with the implementation that is currently on the branch.
However, when I apply the following diff to remove the ggml_repeat it breaks with CUDA:
diff --git a/whisper.cpp b/whisper.cpp
index 1371a6c..80ca5c9 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -1604,22 +1604,22 @@ static struct ggml_cgraph * whisper_build_graph_conv(
// convolution + gelu
{
cur = ggml_conv_1d_ph(ctx0, model.e_conv_1_w, mel, 1, 1);
- //cur = ggml_add(ctx0, cur, model.e_conv_1_b);
- cur = ggml_add(ctx0,
- ggml_repeat(ctx0,
- model.e_conv_1_b,
- cur),
- cur);
+ cur = ggml_add(ctx0, cur, model.e_conv_1_b);
+ //cur = ggml_add(ctx0,
+ // ggml_repeat(ctx0,
+ // model.e_conv_1_b,
+ // cur),
+ // cur);
cur = ggml_gelu(ctx0, cur);
cur = ggml_conv_1d_ph(ctx0, model.e_conv_2_w, cur, 2, 1);
- //cur = ggml_add(ctx0, cur, model.e_conv_2_b);
- cur = ggml_add(ctx0,
- ggml_repeat(ctx0,
- model.e_conv_2_b,
- cur),
- cur);
+ cur = ggml_add(ctx0, cur, model.e_conv_2_b);
+ //cur = ggml_add(ctx0,
+ // ggml_repeat(ctx0,
+ // model.e_conv_2_b,
+ // cur),
+ // cur);
cur = ggml_gelu(ctx0, cur);
}Without GPU offloading (-ng):
WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav -ngWHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav -ng
I whisper.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
make: Nothing to be done for 'default'.
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
whisper_model_load: CPU buffer size = 149.41 MB
whisper_model_load: model size = 149.32 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
whisper_init_state: compute buffer (conv) = 18.50 MB
whisper_init_state: compute buffer (encode) = 81.95 MB
whisper_init_state: compute buffer (cross) = 4.49 MB
whisper_init_state: compute buffer (decode) = 24.70 MB
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
main: processing 'samples/gb0.wav' (2037760 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.240] Good morning. This Tuesday is Election Day.
[00:00:03.240 --> 00:00:06.000] After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.640] the time has come for Americans to make important decisions
[00:00:08.640 --> 00:00:10.120] about our nation's future.
[00:00:10.120 --> 00:00:13.760] I encourage all Americans to go to the polls and vote.
[00:00:13.760 --> 00:00:16.120] Election season brings out the spirit of competition
[00:00:16.120 --> 00:00:18.080] between our political parties.
[00:00:18.080 --> 00:00:20.260] And that competition is an essential part
[00:00:20.260 --> 00:00:21.760] of a healthy democracy.
[00:00:21.760 --> 00:00:23.520] But as the campaigns come to a close,
[00:00:23.520 --> 00:00:26.000] Republicans, Democrats, and independents
[00:00:26.000 --> 00:00:29.120] can find common ground on at least one point.
[00:00:29.120 --> 00:00:31.560] Our system of representative democracy
[00:00:31.560 --> 00:00:34.440] is one of America's greatest strengths.
[00:00:34.440 --> 00:00:36.240] The United States was founded on the belief
[00:00:36.240 --> 00:00:38.240] that all men are created equal.
[00:00:38.240 --> 00:00:41.440] Every election day, millions of Americans of all races,
[00:00:41.440 --> 00:00:43.440] religions, and backgrounds step into voting
[00:00:43.440 --> 00:00:45.280] booths throughout the nation.
[00:00:45.280 --> 00:00:47.780] Whether they are richer, poor, old, or young,
[00:00:47.780 --> 00:00:50.680] each of them has an equal share in choosing the path
[00:00:50.680 --> 00:00:52.440] that our country will take.
[00:00:52.440 --> 00:00:54.920] And every ballot they cast is a reminder
[00:00:54.920 --> 00:00:58.280] that our founding principles are alive and well.
[00:00:58.280 --> 00:00:59.760] Voting is one of the great privileges
[00:00:59.760 --> 00:01:01.760] of American citizenship.
[00:01:01.760 --> 00:01:04.520] And it has always required brave defenders.
[00:01:04.520 --> 00:01:06.040] As you head to the polls next week,
[00:01:06.040 --> 00:01:09.280] remember the sacrifices that have been made by generations
[00:01:09.280 --> 00:01:13.000] of Americans in uniform to preserve our way of life.
[00:01:13.000 --> 00:01:15.480] From Bunker Hill to Baghdad, the men and women
[00:01:15.480 --> 00:01:18.160] of American armed forces have been devoted guardians
[00:01:18.160 --> 00:01:19.960] of our democracy.
[00:01:19.960 --> 00:01:21.800] All of us owe them and their families
[00:01:21.800 --> 00:01:25.240] a special debt of gratitude on Election Day.
[00:01:25.240 --> 00:01:27.560] Americans should also remember the important example
[00:01:27.560 --> 00:01:30.080] that our election set throughout the world.
[00:01:30.080 --> 00:01:32.080] Young democracies from Georgia and Ukraine
[00:01:32.080 --> 00:01:34.560] to Afghanistan and Iraq can look to the United States
[00:01:34.560 --> 00:01:37.560] for proof that self-government can endure.
[00:01:37.560 --> 00:01:40.400] And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080] can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.720] For more than two centuries, Americans
[00:01:45.720 --> 00:01:47.800] have demonstrated the ability of free people
[00:01:47.800 --> 00:01:49.600] to choose their own leaders.
[00:01:49.600 --> 00:01:51.880] Our nation has flourished because of its commitment
[00:01:51.880 --> 00:01:54.640] to trusting the wisdom of our citizenry.
[00:01:54.640 --> 00:01:57.200] In this year's election, we will see this tradition
[00:01:57.200 --> 00:02:00.280] continue, and we will be reminded once again
[00:02:00.280 --> 00:02:02.640] that we are blessed to live in a free nation
[00:02:02.640 --> 00:02:05.520] guided by the will of the people.
[00:02:05.520 --> 00:02:06.960] Thank you for listening.
whisper_print_timings: load time = 161.71 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 77.16 ms
whisper_print_timings: sample time = 175.36 ms / 532 runs ( 0.33 ms per run)
whisper_print_timings: encode time = 2447.21 ms / 5 runs ( 489.44 ms per run)
whisper_print_timings: decode time = 1714.88 ms / 528 runs ( 3.25 ms per run)
whisper_print_timings: prompt time = 356.03 ms / 4 runs ( 89.01 ms per run)
whisper_print_timings: total time = 4943.60 ms
With GPU offloading:
WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wavWHISPER_CUBLAS=1 make -j && ./main -m models/ggml-base.en.bin -f samples/gb0.wav
I whisper.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
make: Nothing to be done for 'default'.
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
whisper_model_load: using CUDA backend
whisper_model_load: CUDA buffer size = 149.41 MB
whisper_model_load: model size = 149.32 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
whisper_init_state: compute buffer (conv) = 14.11 MB
whisper_init_state: compute buffer (encode) = 81.95 MB
whisper_init_state: compute buffer (cross) = 4.49 MB
whisper_init_state: compute buffer (decode) = 24.70 MB
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
main: processing 'samples/gb0.wav' (2037760 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.240] Good morning. This Tuesday is Election Day.
[00:00:03.240 --> 00:00:06.000] After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.640] the time has come for Americans to make important decisions
[00:00:08.640 --> 00:00:10.120] about our nation's future.
[00:00:10.120 --> 00:00:13.760] I encourage all Americans to go to the polls and vote.
[00:00:13.760 --> 00:00:16.120] Election season brings out the spirit of competition
[00:00:16.120 --> 00:00:18.080] between our political parties.
[00:00:18.080 --> 00:00:20.260] And that competition is an essential part
[00:00:20.260 --> 00:00:21.760] of a healthy democracy.
[00:00:21.760 --> 00:00:23.520] But as the campaigns come to a close,
[00:00:23.520 --> 00:00:26.000] Republicans, Democrats, and independents
[00:00:26.000 --> 00:00:29.120] can find common ground on at least one point.
[00:-16:-18.-140 --> 00:00:59.120] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[00:-15:-48.-140 --> 00:01:29.120] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Since the bias tensors are already broadcasted upon load, the diff should not lead to any difference in the results. Also, I can remove either one of the ggml_repeat and it still works. It only breaks when both of them are removed.
Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can I find the gb0.wav sample? With rfk.wav it seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'make samples'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't reproduce this reliably, but it happens sometimes. I suspect that the cause may be that some operation depends on the contents of the memory being cleared.
This change clears dst before executing the op if it is not inplace. Can you test if this fixes the issue for you?
diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 2212144..2ab2ab8 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -8259,6 +8259,18 @@ static void ggml_backend_cuda_graph_compute(ggml_backend_t backend, ggml_cgraph
}
}
+ bool inplace = false;
+ for (int j = 0; j < GGML_MAX_SRC; j++) {
+ if (node->src[j] != nullptr && node->src[j]->data >= node->data && node->src[j]->data < (char *)node->data + ggml_nbytes(node)) {
+ inplace = true;
+ break;
+ }
+ }
+ if (!inplace) {
+ CUDA_CHECK(cudaMemsetAsync(node->data, 0x00, ggml_nbytes(node), g_cudaStreams[g_main_device][0])); // ok
+ //CUDA_CHECK(cudaMemsetAsync(node->data, 0xFA, ggml_nbytes(node), g_cudaStreams[g_main_device][0])); // fail
+ }
+
bool ok = ggml_cuda_compute_forward(¶ms, node);
if (!ok) {
fprintf(stderr, "%s: error: op not supported %s (%s)\n", __func__, node->name, ggml_op_name(node->op));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it fixes the issue. Will look which operation could be causing this. Might be something related to the new ggml_conv implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the help. I think I found a fix for the im2col kernel. Will be doing some more tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty good, should I apply these changes in the ggml PR, or do you want to do it? Honestly, I hadn't thought of that way to save cudaMemset. I hope that backend v2 helps me fix the issue I'm having with stable diffusion when loading everything on the GPU. All operations seem correct, but I'm getting NaN in the output for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I will apply the changes soon. I'm currently implementing im2col for Metal and will use this PR to test that it works.
Sorry to hear about the NaN issues - it's quite difficult to debug. Don't think v2 would be of much help though, but we'll see. Does applying the fix from this PR help?
There is also this fix, which might or might not be relevant to SD:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am already 100% sure that it is not the operations (kernels are correct) causing the issue (using CUDA backend), as when using the CPU backend but performing the complete computation with CUDA (fallback), the results are correct. The only differing factor is the memory handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't really prove that the kernels are fine, they may depend on some pre-conditions that are only true in some specific cases. That is what was happening here with the imcol kernel. In any case, I will add tests and debugging tools in the next ggml-backend update that will make diagnosing these issues easier.
* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggml-org#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggml-org#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggml-org#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggml-org#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggml-org#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggml-org#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
ref #1472
Move the convolution to the GPU as well. The encoder is much faster now