Description
Is your feature request related to a problem? Please describe.
When transcribing text with VoxInput I want to show partial transcriptions. Looking further ahead I also want to have voice commands that don't require the user to press a button, this requires constant streaming and/or VAD.
Describe the solution you'd like
I could implement VAD in VoxInput and make regular requests to to LocalAI using the regular transcription API.
I prototyped this without VAD and it is pretty bad richiejp/VoxInput#2
It would be nice to use Silero VAD model in Whisper and at that point we may as well implement the full streaming API
https://platform.openai.com/docs/guides/realtime-transcription
https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer
Describe alternatives you've considered
Implement VAD in VoxInput, but there are other use-cases for this than VoxInput and it would be nice to keep it simple because I can't distribute that in a container very easily. Also could use Silero with a different backend in LocalAI.
Additional context