Realtime transcription API and VAD

**Is your feature request related to a problem? Please describe.**


When transcribing text with VoxInput I want to show partial transcriptions. Looking further ahead I also want to have voice commands that don't require the user to press a button, this requires constant streaming and/or VAD.

**Describe the solution you'd like**


I could implement VAD in VoxInput and make regular requests to to LocalAI using the regular transcription API.
I prototyped this without VAD and it is pretty bad https://github.com/richiejp/VoxInput/pull/2

It would be nice to use Silero VAD model in Whisper and at that point we may as well implement the full streaming API
https://platform.openai.com/docs/guides/realtime-transcription
https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer

**Describe alternatives you've considered**


Implement VAD in VoxInput, but there are other use-cases for this than VoxInput and it would be nice to keep it simple because I can't distribute that in a container very easily. Also could use Silero with a different backend in LocalAI.

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Realtime transcription API and VAD #5377

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Realtime transcription API and VAD #5377

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions