Closed
Description
Hi,
using the large CoreML encoder provided by huggingface I still have performances very low compared to Vosk with Kaldi and I don't get why.
when I run it:
whisper_init_state: Core ML model loaded
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 1 | OPENVINO = 0 |
so everything is finely set and in theory I should use ANE that should be really performant.
for converting 3h audio (16khz and 1channel) it took 1h, while with Vosk using Kaldi (I use only the CPU), same quality but it took 11min, how is it possible? Am I missing something?
Thank you
Luca