Description
Hey -
I have an issue with prediction.
I've created 3 audio files
The first one is 12 seconds recording - (with 5 seconds of silence) [See image below]
The second one is trimmed last 4 seconds of the first one.
The third one is same as first one - but without long silence in the middle.
Results - 12 seconds full audio 🤔 ❌ - WITH silence in the middle:
{
"prediction": 0,
"probability": 0.012071866542100906
}
Results - 4 seconds trimmed audio ✅ :
{
"prediction": 1,
"probability": 0.9972302317619324
}
Results - 6 seconds - full audio no silence ✅
{
"prediction": 1,
"probability": 0.9956890940666199
}
p.s sending the audio to whisper I get
Audio 1 - 12 seconds - "i prefer cats but please answer quickly"
Audio 2 - 4 seconds - "cat but please answer quickly"
Audio 3 - 6 seconds(No silence in the middle) - "i prefer cats but please answer quickly"
I would expect the prediction to be "1" in all cases.
Why does the prediction fails?
Link to the files
Full audio - https://github.com/rootux/smart-turn-audio/blob/main/a4_complete.ogg
Cropped audio -https://github.com/rootux/smart-turn-audio/blob/main/a4_complete_edited.ogg
Cropped audio - no silence - https://github.com/rootux/smart-turn-audio/blob/main/a4_complete_no_silence_in_middle.ogg