Skip to content

Realtime transcription API and VAD #5377

Closed
@richiejp

Description

@richiejp

Is your feature request related to a problem? Please describe.

When transcribing text with VoxInput I want to show partial transcriptions. Looking further ahead I also want to have voice commands that don't require the user to press a button, this requires constant streaming and/or VAD.

Describe the solution you'd like

I could implement VAD in VoxInput and make regular requests to to LocalAI using the regular transcription API.
I prototyped this without VAD and it is pretty bad richiejp/VoxInput#2

It would be nice to use Silero VAD model in Whisper and at that point we may as well implement the full streaming API
https://platform.openai.com/docs/guides/realtime-transcription
https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer

Describe alternatives you've considered

Implement VAD in VoxInput, but there are other use-cases for this than VoxInput and it would be nice to keep it simple because I can't distribute that in a container very easily. Also could use Silero with a different backend in LocalAI.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions