Train wake words for microWakeWord from the command line

Overview

Most of the other wake word training processes are implemented in Jupyter notebooks. They're OK but I'm a command line guy who finds the notebooks a bit confusing and hard to use, especially if you're training more than one wake word or need to adjust training parameters. With these scripts and Dockerfile, you can train new wake words for Home Assistant, or any other device based on microWakeWord, from the command line.

The guts of these scripts actually came from TaterTotterson's microWakeWord-Trainer-Nvidia-Docker notebook. I just rearranged and refactored them.

There's a lot of information in this README but it shouldn't scare you. There are only 3 scripts you'll need to run:

setup_python_venv: Sets up the python environment.
setup_training_datasets: Downloads and converts the training audio reference datasets.
train_wake_word: Does the training.

The rest of the scripts are run by those three. There's some environment setup first though so...

Please don't skip ahead. Read the entire document before doing anything!

Before You Start

Choose running in a Docker container or directly on your Linux host.

Using a Docker container to train is the best and most predictable way to train but if for some reason you can't run Docker or don't want to, you may be able to train directly on your x86_64 Linux host system IF you have python3.12 available. In this context, "host" could mean Linux running on bare metal or in a virtual machine.

If python3.12 --version doesn't work, you will need to use the Docker container, even if python3 --version or python --version reports version 3.12. There MUST be a python3.12 executable available.

See the details below for why Python 3.12 is required.

Additional packages needed on your host include python3.12-dev (or -devel), python3.12-venv (or python3-virtualenv), git, wget, curl and unzip if they're not already installed.

NOTE: Because there are so many Linux distributions and versions available, it's not possible for me to provide support for training wake words on your host. If you run into issues, I may ask that you reproduce the issue using the Docker container before I investigate.

Using a GPU

Having an Nvidia GPU available can cut the training time by up to half. The open-source nouveau driver shipped with Linux kernels doesn't support CUDA however so if you have an Nvidia GPU and want to use it for training, you'll need to install the official Nvidia driver from https://www.nvidia.com/en-in/drivers/unix/. Make sure you install the version of the driver that includes cuda. You do NOT need to have the "CUDA Toolkit" installed.

Get Started

Create a host data directory

This directory will contain the Python virtual environment plus all of the downloaded and generated data needed for training and the final trained models. You'll need a minimum of 120gb of free space available but this could increase based on the options you choose below.

If you're using the Docker container, your <host_data_dir> will be mounted inside the container as /data however from now on the directory will be referred to as <data_dir> whether you're working directly on the host or in the container unless it's important to distinguish them.

When you've decided where to put the directory, create it and download this repo.

mkdir <host_data_dir>
cd <host_data_dir>
mkdir tools
git clone https://github.com/gtjoseph/microwakeword-cli-trainer ./tools/mww-cli

Build the Docker image

You can skip this step if you're training on your host.

You can use either Docker or Podman as your container management tool. docker is used in the examples but if you have podman, just substitute the command.

# You should still be <host_data_dir>
cd ./tools/mww-cli
docker build -t microwakeword-cli-trainer:latest .

This should be fairly quick and result in an image that's about 320mb in size as it's basically a standard Ubunbtu24.04 image with a few added tools. If it takes more than a minute, I'd be surprised.

So why isn't a pre-built image available for download? Because it'll probably take longer to download a pre-built image than for you to create it locally. GitHub's container registry is notoriously erratic when it comes to download throughput.

Create and start the container

Again, you can skip this step if you're training on your host.

The training container will start a Bash shell so if you have Bash aliases or Bashy things you like, create a .bashrc file in your <host_data_dir> and put them in there. It'll automatically be included any time you enter the container.

There are lots of options that control container creation. The simplest example will create the container and give you an interactive shell. When you exit the shell, the container will be stopped and removed leaving your <host_data_dir> intact.

# You should still be in <host_data_dir>/tools/mww-cli
docker run -it --rm --gpus=all \
  -v <host_data_dir>:/data microwakeword-cli:latest

Options:

Remove the --gpus=all option if you don't have an Nvidia GPU or don't want to use it.
Remove the --rm and add a --name=mww-cli option to keep the container around and give it a name You can stop and remove it when you're ready.
Add a -d option to start the container in the background and use docker attach mww-cli or docker exec -it mww-cli /bin/bash to connect to it.

When the container starts, you'll see a warning about the python virtual environment and you'll be left at a bash shell prompt. Don't worry about the warning. The next step creates it.

Create the Python virtual environment

All of the following steps apply whether you're training in a container or on the host.

Reminder: Going forward, the term <data_dir> means the /data directory if you're training in a container, or <host_data_dir> if you're training directly on your host.

The Python virtual environment will contain all the software needed to train. It gets created as <data_dir>/.venv and will take up about 11gb of disk space.

cd <data_dir>
./tools/mww-cli/setup_python_venv --data-dir="${PWD}"

With a 1gb/sec Internet connection, this could take about 5 minutes. When the installation is finished, a test of the major components will be run.

Once the process is done, activate the virtual environment:

# You should still be in <data_dir>
source .venv/bin/activate

The virtual environment automatically puts the rest of the training commands in your PATH so you won't need to type full paths to them going forward.

Get the audio reference data

The training process itself relies on a significant amount of audio reference data that creates a simulated "audio environment" that your wake word will be trained in. These "training datasets" include things like varying amounts of reverberation, background music, background conversations, background noise, etc. All said and done, it amounts to about 32gb of audio but with the work space needed for the downloaded archives and extracted intermediate files, you'll need about 55gb of free space. Thankfully, you only need to download the files once no matter how many wake words you want to train and it will survive container and/or system restarts.

Data sources:

Reverb and Impulse Noise(mit_rir): https://mcdermottlab.mit.edu/Reverb/IRMAudio/Audio.zip
Background Music(fma): https://huggingface.co/datasets/mchl914/fma_xsmall/resolve/main/fma_xs.zip
Background Speech/Noise(negative): https://huggingface.co/datasets/kahrendt/microwakeword
Audio Samples from YouTube(audioset): https://huggingface.co/datasets/agkphysics/AudioSet

This is a three step process...

Download zipfiles or tarballs.
Extract them into intermediate directories.
Convert the audio files into the final form needed.

To run the process:

# You should still be in <data_dir>
setup_training_datasets --cleanup-archives --cleanup-intermediate-files

The --cleanup options cause the script to cleanup the downloaded archives and intermediate files as it goes along. If you want to keep them, leave those options out but you'll need an additional 32gb of free disk space.

On a 1gb/sec Internet connection, this will take about 25 minutes. It also depends on your location of course.

The script detects if the datasets have already been downloaded, extracted and/or converted and skips those steps as appropriate so if you've run the script without the cleanup options, you can just run it again with those options to clean them up.

Now you're ready to train a wake word. Almost.

Train a Wake Word

Training is done in 3 steps.

Generate thousands of samples of the wake word with various voices, pitches, speeds, inflections, etc.
Augment the samples with the training datasets to add background noise, etc.
Run the Tensorflow training.

Generate a sample for verification

Before you start the full process, you're going to want to generate a single wake word sample and play it back to ensure it sounds right. The wake word should be spelled phonetically to give the sample generator the best chance of success.

# You should still be in <data_dir>
wake_word_sample_generator --samples=1 "hey buster"

The --samples=1 option is important for this example. Don't change it.

The sample wav file will be in <data_dir>/work/test_sample. You should play that file from your host. The reason I used "hey buster" as the wake word is to demonstrate why it's important to generate and listen to a sample. If you try that exact input and play it back, you'll notice that the generator didn't capture the "er" at the end very well. To get it to do so, I had to add a period on the end as a "spacer". "hey buster." worked much better.

When you're happy with the sample, you can run the full process.

Run the full training process

Before you proceed, make sure you have enough free disk space! The process of generating and augmenting samples needs about 1gb for every 1000 samples. The default number of samples is 20,000 so you're going to need about 20gb of additional free space to continue.

So, start training...

# You should still be in <data_dir>
train_wake_word "wake_word" "Wake Word"

Options:

Required:

<wake_word>
      The word to train spelled phonetically and with any puncuation
      or spaces needed to make the sample sound correct.

Optional:

<wake_word_title>
      An optional pretty name to save to the json metadata file.
      It's how the wake word will appear in the Home Assistant
      settings for the device. If you've had to use extra characters
      in the wake word to make it sound right, you'll probably want
      to specify this.
      Default: The wake word with individual words capitalized
      and punctuation removed.

--samples=<samples>
      The number of samples to generate for the wake word.
      Default: 20000

--batch-size=<size>
      How many samples should be generated at a time.  The more
      samples, the more memory is needed.
      Default: 100

--training-steps=<steps>
      Number of training steps.  More training steps means better
      detection and false positive rates but also more time to train.
      Default: 25000

--cleanup-work-dir
      Delete the <data_dir>/work directory after successful training.
      This would clean up about 20gb for 20,000 samples.
      Default: false

By default, the training process creates 20,000 samples of your wake word and runs 25,000 training steps. See Tensorboard Results in the Extra Credit section below for why these are the defaults. Depending on resources available, this could take between 30 and 60 minutes.

The resulting tflite model files and logs will be placed in the <data_dir>/output/<timestamp>-<wake_word>-<samples>-<training-steps> directory. File names will have non-filename-friendly characters in your wake word changed to underscores to make things easier. Installing them usually means copying both files next to your ESPHome configuration yaml file for your device. Here's a sample for the Home Assistant Voice PE:

substitutions:
  name: home-assistant-voice-db4dad
  friendly_name: Home Assistant Voice db4dad
packages:
  Nabu Casa.Home Assistant Voice PE: github://esphome/home-assistant-voice-pe/home-assistant-voice.yaml
esphome:
  name: ${name}
  name_add_mac_suffix: false
  friendly_name: ${friendly_name}
api:
  encryption:
    key: skdsdfhkjyeoisearrkjjnfukierikufhsadfgkasyy=

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

micro_wake_word:
  models:
    - model: Hey_Buster.json
      id: Hey_Buster
    - model: Alexa.json
      id: Alexa

The only real measure of success is how well the resulting model works on a real device. If you encounter too many missed or false activations, increasing the number of samples would probably improve the results more than increasing the number of training steps. See Tensorboard Results in the Extra Credit section below.

The log output from the last step is filtered some by the script but still quite verbose. The full log will be available in the output directory as training.log if you're interested. Interpreting the log is beyond the scope of this project however.

You can train additional wake words or change the number of samples and training steps by simply running train_wake_word again. No need to repeat any of the earlier setup steps. If you change the wake word or the number of wake word samples, the work directory will be deleted and all 3 steps re-run. If you only change the number of training steps, the data from the first two steps is still valid and only the 3rd step is run.

All of the intermediate data is stored in the <data_dir>/work directory which will grow to about 20gb with 20,000 wake word samples. Once the tflite model is successfully generated and you're happy with the results, you can delete the <data_dir>/work directory.

Training more than one wake word

You easily train multiple wake words. Just create a bash array with the phonetically spelled words as the keys and the pretty names as the values.

declare -A words=( ['alexa!']="Alexa" ['hey_buster.']="Hey Buster" ['hey_jenkins']="Hey Jenkins" )
for k in "${!words[@]}" ; do train_wake_word --cleanup-work-dir "$k" "${words[$k]}" ; done

Training time examples

Training times depend on lots of things however during testing, there was NO difference in training times between training directly on a host vs training in a Docker container.

These are examples only. Your Mileage May Vary!!!

================================================================================
                            Training Summary

CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores)  Memory: 64195 mb
GPU: N/A

                 Generate 20000 samples, 100/batch Elapsed time: 0:10:38
                             Augment 20000 samples Elapsed time: 0:07:04
                              25000 training steps Elapsed time: 0:25:21
                          ======================================================
                                             Total Elapsed time: 0:43:03
================================================================================

================================================================================
                            Training Summary

CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores)  Memory: 64195 mb
GPU: NVIDIA GeForce RTX 3060 (3584 cores)  Memory: 11909 mb

                 Generate 20000 samples, 100/batch Elapsed time: 0:00:53
                             Augment 20000 samples Elapsed time: 0:07:05
                              25000 training steps Elapsed time: 0:19:13
                          ======================================================
                                             Total Elapsed time: 0:27:11
================================================================================

================================================================================
                            Training Summary

CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores)  Memory: 64195 mb
GPU: N/A

                 Generate 50000 samples, 100/batch Elapsed time: 0:30:47
                             Augment 50000 samples Elapsed time: 0:20:22
                              40000 training steps Elapsed time: 1:01:51
                              ==================================================
                                             Total Elapsed time: 1:53:00
================================================================================

================================================================================
                            Training Summary

CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz (20 cores)  Memory: 64195 mb
GPU: NVIDIA GeForce RTX 3060 (3584 cores)  Memory: 11909 mb

                 Generate 50000 samples, 100/batch Elapsed time: 0:02:08
                             Augment 50000 samples Elapsed time: 0:19:13
                              40000 training steps Elapsed time: 0:42:23
                          ======================================================
                                             Total Elapsed time: 1:03:44
================================================================================

Extra Credit

Training defaults

If you plan on training multiple wake words, you can set your own default training parameters by creating a <data_dir>/.defaults.env file with the following contents:

# Variable names follow the command line parameters converted to upper case
# and with the dashes ('-') converted to underscores ('_').
export SAMPLES=30000
export TRAINING_STEPS=35000

# Uncomment the following to NOT use the GPU for any operations.
#export CUDA_VISIBLE_DEVICES=-1

Examine your model with Tensorboard

Tensorboard is a web-based graphical model viewer. You can use it to get an idea of how many training steps are needed before accuracy results stop improving. To use it, you'll have to expose port 6006 by adding -p 6006:6006 to your docker run command line. If you didn't, don't worry. Remember, the /data directory is mapped to a directory on your host so you can simply stop and delete the current container and recreate it with the new docker run command. No need to re-run any of the setup or training steps.

To start Tensorboard, run:

(.venv) <data_dir> $ tensorboard --bind_all --logdir ./output

Now on your host, point your browser at http://localhost:6006/, click "SCALARS" at the top and take a look at the various charts. You'll see a "train" and "validation" item for each training run you've performed. It's the "train" items you're interested in.

You'd have to be a Tensorflow expert to decipher most of the charts but the "Accuracy" chart for this particular wake word and 50,000 samples would seem to indicate that there's very little improvement after about 20,000 training steps.

In contrast, with only 5,000 wake word samples, there's still improvement to be had after 20,000 training steps.

Given that it's faster to generate wake word samples than it is to train, 20,000 samples and 25,000 training steps seems like a good compromise. This chart has a bit less smoothing to show a bit more detail and includes the 50,000 sample run as well. This run took only 27 minutes as opposed to the 63 minutes it took for the 50,000 sample run. Now you know why 20,000 and 25,000 are the defaults for these scripts.

Why to NOT train directly on your host

Training is all Python based and many of the tools needed, including Tensorflow, Torch, onnxruntime, piper-sample-generator and micro-wake-word, plus their dependencies, are dependent on the version of Python they're going to be run with. Even though Python 3.13, 3.14, are now generally available, many of the packages aren't compatible with them yet. If you want to use an Nvidia GPU to speed up the training, things get even more complicated because the CUDA packages needed to access the GPU may not support newer GPUs when used with Python versions before 3.10. So where's this going? YOU NEED PYTHON VERSION 3.12.

If your Linux host system has Python 3.12 available, you should be able to train on your host without the Docker container. Try running python3.12 --version. If it works, you're good to go. If not, you'll have to figure out how to get python3.12 on your host youself or use the Docker container.

The Docker container is based on Ubuntu24.04 which uses Python 3.12 by default so I know that works. Fedora 40 was the last version that shipped with Python 3.12 but later Fedora versions allow you to easily install other Python versions beside the system-wide version. Fedora 43 for instance, ships with python3.14 as the default but you can install 3.12 right beside it by simply running dnf install python3.12 python3.12-devel. I run Fedora 43 so I know that works as well.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.bashrc		.bashrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup_audioset		setup_audioset
setup_fma		setup_fma
setup_mit_audio		setup_mit_audio
setup_negative_datasets		setup_negative_datasets
setup_python_venv		setup_python_venv
setup_training_datasets		setup_training_datasets
shell.functions		shell.functions
system_summary		system_summary
tensorboard1.png		tensorboard1.png
tensorboard2.png		tensorboard2.png
tensorboard3.png		tensorboard3.png
test_cuda		test_cuda
test_python		test_python
train_wake_word		train_wake_word
wake_word_sample_augmenter		wake_word_sample_augmenter
wake_word_sample_generator		wake_word_sample_generator
wake_word_sample_trainer		wake_word_sample_trainer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Train wake words for microWakeWord from the command line

Overview

Before You Start

Choose running in a Docker container or directly on your Linux host.

Using a GPU

Get Started

Create a host data directory

Build the Docker image

Create and start the container

Create the Python virtual environment

Get the audio reference data

Train a Wake Word

Generate a sample for verification

Run the full training process

Training more than one wake word

Training time examples

Extra Credit

Training defaults

Examine your model with Tensorboard

Why to NOT train directly on your host

About

Uh oh!

Languages

License

gtjoseph/microwakeword-cli-trainer

Folders and files

Latest commit

History

Repository files navigation

Train wake words for microWakeWord from the command line

Overview

Before You Start

Choose running in a Docker container or directly on your Linux host.

Using a GPU

Get Started

Create a host data directory

Build the Docker image

Create and start the container

Create the Python virtual environment

Get the audio reference data

Train a Wake Word

Generate a sample for verification

Run the full training process

Training more than one wake word

Training time examples

Extra Credit

Training defaults

Examine your model with Tensorboard

Why to NOT train directly on your host

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages