You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across a problem when using VAD (silero and auditok) with the large model in my application where I try to break parts of the transcription based on pauses. In the following sample you can see that with VAD and the large model (but not with the smaller ones!), I get incorrect timestamps for the "A":
--model large --language en --accurate --vad auditok
[01:37.040 --> 01:37.240] Could
[01:37.240 --> 01:37.360] you
[01:37.360 --> 01:37.540] please
[01:37.540 --> 01:37.800] hold
[01:37.800 --> 01:37.920] up
[01:37.920 --> 01:38.060] your
[01:38.060 --> 01:38.280] ID
[01:38.280 --> 01:38.500] to
[01:38.500 --> 01:38.660] the
[01:38.660 --> 01:38.860] webcam?
*** [01:39.120 --> 01:39.700] A >>>>>>>>>>>>>> Pause between "A" and "little" is wrong
[01:45.050 --> 01:45.250] little
[01:45.250 --> 01:45.430] bit
[01:45.430 --> 01:45.690] closer,
[01:45.770 --> 01:46.070] please.
--model large --language en --accurate --vad silero:v3.1
[01:37.030 --> 01:37.230] Could
[01:37.230 --> 01:37.350] you
[01:37.350 --> 01:37.550] please
[01:37.550 --> 01:37.830] hold
[01:37.830 --> 01:37.930] up
[01:37.930 --> 01:38.070] your
[01:38.070 --> 01:38.290] ID
[01:38.290 --> 01:38.510] to
[01:38.510 --> 01:38.650] the
[01:38.650 --> 01:38.810] webcam?
*** [01:39.130 --> 01:39.790] A >>>>>>>>>>>>>> Pause between "A" and "little" is wrong
[01:45.050 --> 01:45.230] little
[01:45.230 --> 01:45.430] bit
[01:45.430 --> 01:45.690] closer,
[01:45.770 --> 01:46.050] please.
--model large --language en --accurate --vad False
[01:36.860 --> 01:37.180] Could
[01:37.180 --> 01:37.340] you
[01:37.340 --> 01:37.600] please
[01:37.600 --> 01:37.820] hold
[01:37.820 --> 01:37.940] up
[01:37.940 --> 01:38.060] your
[01:38.060 --> 01:38.280] ID
[01:38.280 --> 01:38.520] to
[01:38.520 --> 01:38.660] the
[01:38.660 --> 01:38.920] webcam?
*** [01:44.240 --> 01:45.020] A >>>>>>>>>>>>>> This is okay
[01:45.020 --> 01:45.260] little
[01:45.260 --> 01:45.420] bit
[01:45.420 --> 01:45.680] closer,
[01:45.820 --> 01:46.020] please.
--model medium --language en --accurate --vad auditok
[01:37.180 --> 01:37.360] Could
[01:37.360 --> 01:37.520] you
[01:37.520 --> 01:37.780] please
[01:37.780 --> 01:37.940] hold
[01:37.940 --> 01:38.080] up
[01:38.080 --> 01:38.260] your
[01:38.260 --> 01:38.500] ID
[01:38.500 --> 01:38.680] to
[01:38.680 --> 01:38.800] the
[01:38.800 --> 01:39.260] webcam?
*** [01:44.890 --> 01:45.270] A >>>>>>>>>>>>>> This is okay
[01:45.270 --> 01:45.410] little
[01:45.410 --> 01:45.610] bit
[01:45.610 --> 01:45.850] closer,
[01:46.070 --> 01:46.650] please.
Please find the attached sample audio in a zip archive to reproduce this.
The text was updated successfully, but these errors were encountered:
freddyertl
changed the title
Incorrect timestamps based with --vad with large model only
Incorrect timestamps when using VAD with large model only
Apr 24, 2024
I came across a problem when using VAD (silero and auditok) with the large model in my application where I try to break parts of the transcription based on pauses. In the following sample you can see that with VAD and the large model (but not with the smaller ones!), I get incorrect timestamps for the "A":
--model large --language en --accurate --vad auditok
[01:37.040 --> 01:37.240] Could
[01:37.240 --> 01:37.360] you
[01:37.360 --> 01:37.540] please
[01:37.540 --> 01:37.800] hold
[01:37.800 --> 01:37.920] up
[01:37.920 --> 01:38.060] your
[01:38.060 --> 01:38.280] ID
[01:38.280 --> 01:38.500] to
[01:38.500 --> 01:38.660] the
[01:38.660 --> 01:38.860] webcam?
*** [01:39.120 --> 01:39.700] A >>>>>>>>>>>>>> Pause between "A" and "little" is wrong
[01:45.050 --> 01:45.250] little
[01:45.250 --> 01:45.430] bit
[01:45.430 --> 01:45.690] closer,
[01:45.770 --> 01:46.070] please.
--model large --language en --accurate --vad silero:v3.1
[01:37.030 --> 01:37.230] Could
[01:37.230 --> 01:37.350] you
[01:37.350 --> 01:37.550] please
[01:37.550 --> 01:37.830] hold
[01:37.830 --> 01:37.930] up
[01:37.930 --> 01:38.070] your
[01:38.070 --> 01:38.290] ID
[01:38.290 --> 01:38.510] to
[01:38.510 --> 01:38.650] the
[01:38.650 --> 01:38.810] webcam?
*** [01:39.130 --> 01:39.790] A >>>>>>>>>>>>>> Pause between "A" and "little" is wrong
[01:45.050 --> 01:45.230] little
[01:45.230 --> 01:45.430] bit
[01:45.430 --> 01:45.690] closer,
[01:45.770 --> 01:46.050] please.
--model large --language en --accurate --vad False
[01:36.860 --> 01:37.180] Could
[01:37.180 --> 01:37.340] you
[01:37.340 --> 01:37.600] please
[01:37.600 --> 01:37.820] hold
[01:37.820 --> 01:37.940] up
[01:37.940 --> 01:38.060] your
[01:38.060 --> 01:38.280] ID
[01:38.280 --> 01:38.520] to
[01:38.520 --> 01:38.660] the
[01:38.660 --> 01:38.920] webcam?
*** [01:44.240 --> 01:45.020] A >>>>>>>>>>>>>> This is okay
[01:45.020 --> 01:45.260] little
[01:45.260 --> 01:45.420] bit
[01:45.420 --> 01:45.680] closer,
[01:45.820 --> 01:46.020] please.
--model medium --language en --accurate --vad auditok
[01:37.180 --> 01:37.360] Could
[01:37.360 --> 01:37.520] you
[01:37.520 --> 01:37.780] please
[01:37.780 --> 01:37.940] hold
[01:37.940 --> 01:38.080] up
[01:38.080 --> 01:38.260] your
[01:38.260 --> 01:38.500] ID
[01:38.500 --> 01:38.680] to
[01:38.680 --> 01:38.800] the
[01:38.800 --> 01:39.260] webcam?
*** [01:44.890 --> 01:45.270] A >>>>>>>>>>>>>> This is okay
[01:45.270 --> 01:45.410] little
[01:45.410 --> 01:45.610] bit
[01:45.610 --> 01:45.850] closer,
[01:46.070 --> 01:46.650] please.
Please find the attached sample audio in a zip archive to reproduce this.
Thanks in advance
Freddy
sample.zip
The text was updated successfully, but these errors were encountered: