Without buttons or wake-words, use any browser (Cordova/Electron) to passively listen for any voice, record until voice stops, POST audio to Faster-Whisper & get transcription back with language detected.
Tiny model on my GTX1070 gave results in less than half a second, so we'll call it instant... Large-V2 model took 1.5-3 sec. To configure VAD & STT libraries see their documentation. See ricky0123/vad and https://www.vad.ricky0123.com for details on the Voice Activity Detection. Implementing in React? See https://www.vad.ricky0123.com/docs/react
Silero VAD browser library by ricky0123
STT web service endpoint by ololoshka2871
... which is based on guillaumekln/faster-whisper (and OpenAI model)
- Create python virtual environment:
python3 -m venv venv
- Activate virtual environment:
source venv/bin/activate
- Install requirements:
pip install -r requirements.txt
- Run server:
python main.py
(Enable CUDA: -c cuda) (Tiny model: -m tiny) (Large model: -m large-v2) - Open browser to localhost:3157 (0.0.0.0 is not valid for mic access outside HTTPS).
- Approve microphone access and start speaking. Output in console & page.
I wanted hands-free STT in a web client using the Faster-Whisper model. But the REST server for the base Whisper model did not work with the Faster model. After spending a couple days messing around I found the code from ololoshka and it worked! Since I was unable to find anything like this all put together as a component set, here it is. This is just a starter/skeleton/sample for adding this feature to your own app/bot/etc. All credit to the devs behind all these projects, and the open source ethos.