Voice assistant example - the "command" tool #190
Replies: 12 comments 13 replies
-
This is now fully functional: command-0.mp4Code is in examples/command Web version: examples/command.wasm |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
The command seems to work great but the keyword not so great, but will give it another try. |
Beta Was this translation helpful? Give feedback.
-
No I was wondering if the timestamps where in anyway accuracte and I could use as a forced alighner to extract Keyword or use another? |
Beta Was this translation helpful? Give feedback.
-
I like this, but maybe I am missing something. Should the behaviour be to activate and then never turn off, or be more like Alexa/Siri, and have a 'wake word' followed by a command. i.e. you say: "Hey Whisper what is the time ", and it outputs something like: Then, until you say a command with "Hey Whisper" at the beginning, nothing happens with normal speech. |
Beta Was this translation helpful? Give feedback.
-
My application needs to handle 1 or 2 word commands. In the future, maybe 3. There are 3 formats:
Presently all is working pretty well. I built an error dictionary for the words, to correct Whisper's errors. Numeric works very well. The issue is Whisper has trouble with certain words, like "fur" and "crib". The error dictionary takes care of most of the errors. I just inplemented your command example, and I fill allowed_commands with all the words I need for commands. This is working quite well! The issue is that to use the code like your command example, you need to set max_tokens=1. Unfortunately, this breaks my little scheme of setting max_tokens=3 so I can read in a three second chunk of audio that contains the 2 words. The ugly hack I thought of is to run whisper_full twice, the first time with max_tokens=1 to get the solution to the first word, and then run again with max_tokens=3 to get the second word (which can be numeric or a regular word). This is going to double the time and is not a good solution IMHO. Is there an easy solution to this? |
Beta Was this translation helpful? Give feedback.
-
I went ahead and implemented the "two pass" method I alluded to above, except I switched the order. First I call whisper_full with max_tokens=3. Then, if word 1 is not found after running through the error dictionary, I call whisper_full again with max_tokens=1, and use the command mode code to find a match. There is a problem though. Using command mode will always return a result. I need to look at the probability value to select if the result is accepted, otherwise I get "false positives" (incorrect, but accepted commands) which I really do not want. Just using whisper in non-command mode very rarely will decide on a wrong command, it will just fail, which is preferable to the wrong command. I am now struggling with a probability threshold. I find if it's set too high, the command pass will miss many corrections. If it's set too low, I get false positives. |
Beta Was this translation helpful? Give feedback.
-
hi can I get output in a ON file |
Beta Was this translation helpful? Give feedback.
-
I have succeded to build main.exe for Windows with VS2022. |
Beta Was this translation helpful? Give feedback.
-
hey I tried command mode and it's pretty great, take a lot less resources on my old intel mac than regular stream. I have a bunch of questions and suggestions at the same time Suggestions:
Questions:
Thanks! p.s. if anyone got cool examples of how they're using this feature, please share! |
Beta Was this translation helpful? Give feedback.
-
There seems to be significant interest for a voice assistant application of Whisper, similar to "Ok, Google", "Hey Siri", "Alexa", etc. The existing stream tool is not very applicable for this use case, because the voice assistant commands are usually short (i.e.
play some music
,turn on the TV
,kill all humans
,feed the baby
, etc), whilestream
expects a continuous stream of speech.Therefore, implement a basic command-line tool called
command
that does the following:[key phrase][command]
, so by knowing the key phrase we can extract only the[command]
This should work in Web and Raspberry Pi and thanks to the VAD, it will be energy efficient.
Should be a good starting example for creating a voice assistant.
Beta Was this translation helpful? Give feedback.
All reactions