Train wake words for microWakeWord from the command line
Most of the other wake word training processes are implemented in Jupyter notebooks. They're OK but I'm a command line guy who finds the notebooks a bit confusing and hard to use, especially if you're training more than one wake word or need to adjust training parameters. With these scripts and Dockerfile, you can train new wake words for Home Assistant, or any other device based on microWakeWord, from the command line.
The guts of these scripts actually came from TaterTotterson's microWakeWord-Trainer-Nvidia-Docker notebook. I just rearranged and refactored them.
There's a lot of information in this README but it shouldn't scare you. There are only 3 scripts you'll need to run:
setup_python_venv: Sets up the python environment.setup_training_datasets: Downloads and converts the training audio reference datasets.train_wake_word: Does the training.
The rest of the scripts are run by those three. There's some environment setup first though so...
Please don't skip ahead. Read the entire document before doing anything!
Using a Docker container to train is the best and most predictable way to train
but if for some reason you can't run Docker or don't want to, you may
be able to train directly on your x86_64 Linux host system IF you have python3.12
available. In this context, "host" could mean Linux running
on bare metal or in a virtual machine.
If python3.12 --version doesn't work, you will need to use the Docker
container, even if python3 --version or python --version reports
version 3.12. There MUST be a python3.12 executable available.
See the details below for why Python 3.12 is required.
Additional packages needed on your host include python3.12-dev (or -devel), python3.12-venv (or python3-virtualenv), git, wget, curl and unzip if they're not already installed.
NOTE: Because there are so many Linux distributions and versions available, it's not possible for me to provide support for training wake words on your host. If you run into issues, I may ask that you reproduce the issue using the Docker container before I investigate.
Having an Nvidia GPU available can cut the training time by up to half. The
open-source nouveau driver shipped with Linux kernels doesn't support CUDA
however so if you have an Nvidia GPU and want to use it for training, you'll
need to install the official Nvidia driver from
https://www.nvidia.com/en-in/drivers/unix/.
Make sure you install the version of the driver that includes cuda.
You do NOT need to have the "CUDA Toolkit" installed.
This directory will contain the Python virtual environment plus all of the downloaded and generated data needed for training and the final trained models. You'll need a minimum of 120gb of free space available but this could increase based on the options you choose below.
If you're using the Docker container, your <host_data_dir> will be mounted
inside the container as /data however from now on the directory will
be referred to as <data_dir> whether you're working directly on the host or
in the container unless it's important to distinguish them.
When you've decided where to put the directory, create it and download this repo.
mkdir <host_data_dir>
cd <host_data_dir>
mkdir tools
git clone https://github.com/gtjoseph/microwakeword-cli-trainer ./tools/mww-cliYou can skip this step if you're training on your host.
You can use either Docker or Podman as your container management tool.
docker is used in the examples but if you have podman, just substitute
the command.
# You should still be <host_data_dir>
cd ./tools/mww-cli
docker build -t microwakeword-cli-trainer:latest .This should be fairly quick and result in an image that's about 320mb in size as it's basically a standard Ubunbtu24.04 image with a few added tools. If it takes more than a minute, I'd be surprised.
So why isn't a pre-built image available for download? Because it'll probably take longer to download a pre-built image than for you to create it locally. GitHub's container registry is notoriously erratic when it comes to download throughput.
Again, you can skip this step if you're training on your host.
The training container will start a Bash shell so if you have Bash
aliases or Bashy things you like, create a .bashrc file in your
<host_data_dir> and put them in there. It'll automatically be included
any time you enter the container.
There are lots of options that control container creation. The simplest example
will create the container and give you an interactive shell. When you exit the
shell, the container will be stopped and removed leaving your <host_data_dir>
intact.
# You should still be in <host_data_dir>/tools/mww-cli
docker run -it --rm --gpus=all \
-v <host_data_dir>:/data microwakeword-cli:latestOptions:
- Remove the
--gpus=alloption if you don't have an Nvidia GPU or don't want to use it. - Remove the
--rmand add a--name=mww-clioption to keep the container around and give it a name You can stop and remove it when you're ready. - Add a
-doption to start the container in the background and usedocker attach mww-cliordocker exec -it mww-cli /bin/bashto connect to it.
When the container starts, you'll see a warning about the python virtual environment and you'll be left at a bash shell prompt. Don't worry about the warning. The next step creates it.
All of the following steps apply whether you're training in a container or on the host.
Reminder: Going forward, the term <data_dir> means the /data directory if
you're training in a container, or <host_data_dir> if you're training directly on your host.
The Python virtual environment will contain all the software needed to train.
It gets created as <data_dir>/.venv and will take up about 11gb of disk space.
cd <data_dir>
./tools/mww-cli/setup_python_venv --data-dir="${PWD}"With a 1gb/sec Internet connection, this could take about 5 minutes. When the installation is finished, a test of the major components will be run.
Once the process is done, activate the virtual environment:
# You should still be in <data_dir>
source .venv/bin/activateThe virtual environment automatically puts the rest of the training commands in your PATH so you won't need to type full paths to them going forward.
The training process itself relies on a significant amount of audio reference data that creates a simulated "audio environment" that your wake word will be trained in. These "training datasets" include things like varying amounts of reverberation, background music, background conversations, background noise, etc. All said and done, it amounts to about 32gb of audio but with the work space needed for the downloaded archives and extracted intermediate files, you'll need about 55gb of free space. Thankfully, you only need to download the files once no matter how many wake words you want to train and it will survive container and/or system restarts.
Data sources:
- Reverb and Impulse Noise(mit_rir): https://mcdermottlab.mit.edu/Reverb/IRMAudio/Audio.zip
- Background Music(fma): https://huggingface.co/datasets/mchl914/fma_xsmall/resolve/main/fma_xs.zip
- Background Speech/Noise(negative): https://huggingface.co/datasets/kahrendt/microwakeword
- Audio Samples from YouTube(audioset): https://huggingface.co/datasets/agkphysics/AudioSet
This is a three step process...
- Download zipfiles or tarballs.
- Extract them into intermediate directories.
- Convert the audio files into the final form needed.
To run the process:
# You should still be in <data_dir>
setup_training_datasets --cleanup-archives --cleanup-intermediate-filesThe --cleanup options cause the script to cleanup the downloaded archives
and intermediate files as it goes along. If you want to keep them, leave
those options out but you'll need an additional 32gb of free disk space.
On a 1gb/sec Internet connection, this will take about 25 minutes. It also depends on your location of course.
The script detects if the datasets have already been downloaded, extracted and/or converted and skips those steps as appropriate so if you've run the script without the cleanup options, you can just run it again with those options to clean them up.
Now you're ready to train a wake word. Almost.
Training is done in 3 steps.
- Generate thousands of samples of the wake word with various voices, pitches, speeds, inflections, etc.
- Augment the samples with the training datasets to add background noise, etc.
- Run the Tensorflow training.
Before you start the full process, you're going to want to generate a single wake word sample and play it back to ensure it sounds right. The wake word should be spelled phonetically to give the sample generator the best chance of success.
# You should still be in <data_dir>
wake_word_sample_generator --samples=1 "hey buster"The --samples=1 option is important for this example. Don't change it.
The sample wav file will be in <data_dir>/work/test_sample. You should
play that file from your host. The reason I used "hey buster" as the wake
word is to demonstrate why it's important to generate and listen to a sample.
If you try that exact input and play it back, you'll notice that the
generator didn't capture the "er" at the end very well. To get it to do so, I
had to add a period on the end as a "spacer". "hey buster." worked much better.
When you're happy with the sample, you can run the full process.
Before you proceed, make sure you have enough free disk space! The process of generating and augmenting samples needs about 1gb for every 1000 samples. The default number of samples is 20,000 so you're going to need about 20gb of additional free space to continue.
So, start training...
# You should still be in <data_dir>
train_wake_word "wake_word" "Wake Word"Options:
Required:
<wake_word>
The word to train spelled phonetically and with any puncuation
or spaces needed to make the sample sound correct.
Optional:
<wake_word_title>
An optional pretty name to save to the json metadata file.
It's how the wake word will appear in the Home Assistant
settings for the device. If you've had to use extra characters
in the wake word to make it sound right, you'll probably want
to specify this.
Default: The wake word with individual words capitalized
and punctuation removed.
--samples=<samples>
The number of samples to generate for the wake word.
Default: 20000
--batch-size=<size>
How many samples should be generated at a time. The more
samples, the more memory is needed.
Default: 100
--training-steps=<steps>
Number of training steps. More training steps means better
detection and false positive rates but also more time to train.
Default: 25000
--cleanup-work-dir
Delete the <data_dir>/work directory after successful training.
This would clean up about 20gb for 20,000 samples.
Default: false
By default, the training process creates 20,000 samples of your wake word and runs 25,000 training steps. See Tensorboard Results in the Extra Credit section below for why these are the defaults. Depending on resources available, this could take between 30 and 60 minutes.
The resulting tflite model files and logs will be placed in the
<data_dir>/output/<timestamp>-<wake_word>-<samples>-<training-steps> directory.
File names will have non-filename-friendly characters in your
wake word changed to underscores to make things easier. Installing them usually
means copying both files next to your ESPHome configuration yaml file for your device.
Here's a sample for the Home Assistant Voice PE:
substitutions:
name: home-assistant-voice-db4dad
friendly_name: Home Assistant Voice db4dad
packages:
Nabu Casa.Home Assistant Voice PE: github://esphome/home-assistant-voice-pe/home-assistant-voice.yaml
esphome:
name: ${name}
name_add_mac_suffix: false
friendly_name: ${friendly_name}
api:
encryption:
key: skdsdfhkjyeoisearrkjjnfukierikufhsadfgkasyy=
wifi:
ssid: !secret wifi_ssid
password: !secret wifi_password
micro_wake_word:
models:
- model: Hey_Buster.json
id: Hey_Buster
- model: Alexa.json
id: Alexa
The only real measure of success is how well the resulting model works on a real device. If you encounter too many missed or false activations, increasing the number of samples would probably improve the results more than increasing the number of training steps. See Tensorboard Results in the Extra Credit section below.
The log output from the last step is filtered some by the script but still quite
verbose. The full log will be available in the output directory as
training.log if you're interested. Interpreting the log is beyond the scope
of this project however.
You can train additional wake words or change the number of samples and
training steps by simply running train_wake_word again. No need to repeat
any of the earlier setup steps. If you change the wake word or the number of
wake word samples, the work directory will be deleted and all 3 steps re-run.
If you only change the number of training steps, the data from the first two
steps is still valid and only the 3rd step is run.
All of the intermediate data is stored in the <data_dir>/work directory which will
grow to about 20gb with 20,000 wake word samples. Once the tflite model is
successfully generated and you're happy with the results, you can delete the
<data_dir>/work directory.
You easily train multiple wake words. Just create a bash array with the phonetically spelled words as the keys and the pretty names as the values.
declare -A words=( ['alexa!']="Alexa" ['hey_buster.']="Hey Buster" ['hey_jenkins']="Hey Jenkins" )
for k in "${!words[@]}" ; do train_wake_word --cleanup-work-dir "$k" "${words[$k]}" ; doneTraining times depend on lots of things however during testing, there was NO difference in training times between training directly on a host vs training in a Docker container.
These are examples only. Your Mileage May Vary!!!
================================================================================
Training Summary
CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores) Memory: 64195 mb
GPU: N/A
Generate 20000 samples, 100/batch Elapsed time: 0:10:38
Augment 20000 samples Elapsed time: 0:07:04
25000 training steps Elapsed time: 0:25:21
======================================================
Total Elapsed time: 0:43:03
================================================================================
================================================================================
Training Summary
CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores) Memory: 64195 mb
GPU: NVIDIA GeForce RTX 3060 (3584 cores) Memory: 11909 mb
Generate 20000 samples, 100/batch Elapsed time: 0:00:53
Augment 20000 samples Elapsed time: 0:07:05
25000 training steps Elapsed time: 0:19:13
======================================================
Total Elapsed time: 0:27:11
================================================================================
================================================================================
Training Summary
CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores) Memory: 64195 mb
GPU: N/A
Generate 50000 samples, 100/batch Elapsed time: 0:30:47
Augment 50000 samples Elapsed time: 0:20:22
40000 training steps Elapsed time: 1:01:51
==================================================
Total Elapsed time: 1:53:00
================================================================================
================================================================================
Training Summary
CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores) Memory: 64195 mb
GPU: NVIDIA GeForce RTX 3060 (3584 cores) Memory: 11909 mb
Generate 50000 samples, 100/batch Elapsed time: 0:02:08
Augment 50000 samples Elapsed time: 0:19:13
40000 training steps Elapsed time: 0:42:23
======================================================
Total Elapsed time: 1:03:44
================================================================================
If you plan on training multiple wake words, you can set your own default
training parameters by creating a <data_dir>/.defaults.env file with the
following contents:
# Variable names follow the command line parameters converted to upper case
# and with the dashes ('-') converted to underscores ('_').
export SAMPLES=30000
export TRAINING_STEPS=35000
# Uncomment the following to NOT use the GPU for any operations.
#export CUDA_VISIBLE_DEVICES=-1
Tensorboard is a web-based graphical model viewer. You can use it to get an
idea of how many training steps are needed before accuracy results stop
improving. To use it, you'll have to expose port 6006 by adding -p 6006:6006 to your docker run command line. If you didn't, don't worry.
Remember, the /data directory is mapped to a directory on your host so you
can simply stop and delete the current container and recreate it with the new
docker run command. No need to re-run any of the setup or training steps.
To start Tensorboard, run:
(.venv) <data_dir> $ tensorboard --bind_all --logdir ./outputNow on your host, point your browser at http://localhost:6006/,
click "SCALARS" at the top and take a look at the various charts. You'll see
a "train" and "validation" item for each training run you've performed. It's
the "train" items you're interested in.
You'd have to be a Tensorflow expert to decipher most of the charts but the "Accuracy" chart for this particular wake word and 50,000 samples would seem to indicate that there's very little improvement after about 20,000 training steps.
In contrast, with only 5,000 wake word samples, there's still improvement to be had after 20,000 training steps.
Given that it's faster to generate wake word samples than it is to train, 20,000 samples and 25,000 training steps seems like a good compromise. This chart has a bit less smoothing to show a bit more detail and includes the 50,000 sample run as well. This run took only 27 minutes as opposed to the 63 minutes it took for the 50,000 sample run. Now you know why 20,000 and 25,000 are the defaults for these scripts.
Training is all Python based and many of the tools needed, including Tensorflow, Torch, onnxruntime, piper-sample-generator and micro-wake-word, plus their dependencies, are dependent on the version of Python they're going to be run with. Even though Python 3.13, 3.14, are now generally available, many of the packages aren't compatible with them yet. If you want to use an Nvidia GPU to speed up the training, things get even more complicated because the CUDA packages needed to access the GPU may not support newer GPUs when used with Python versions before 3.10. So where's this going? YOU NEED PYTHON VERSION 3.12.
If your Linux host system has Python 3.12 available, you should be able to
train on your host without the Docker container. Try running
python3.12 --version. If it works, you're good to go. If not, you'll have
to figure out how to get python3.12 on your host youself or use the Docker
container.
The Docker container is based on Ubuntu24.04 which uses Python 3.12 by default
so I know that works. Fedora 40 was the last version that shipped with
Python 3.12 but later Fedora versions allow you to easily install other
Python versions beside the system-wide version. Fedora 43 for instance,
ships with python3.14 as the default but you can install 3.12 right beside it
by simply running dnf install python3.12 python3.12-devel. I run Fedora 43
so I know that works as well.


