Skip to content

Model predictions vary significantly depending on position of wakeword in audio #237

Open
@dscripka

Description

@dscripka

Describe the bug
When using the Python bindings for Precise, I've noticed that the model predictions can vary substantially depending on where in the input audio the wake word is located. For example, The plot below shows the default "hey mycroft" model score for two repetitions of the same audio clip, where the only difference is that the second clip has one additional frame (1024 samples) of zero-padding compare to the first clip:

image

I'm currently doing some evaluation of Precise compared to other wakeword solutions, and this behavior is making it difficult to accurately assess performance as the length and padding of the test clips can cause significant differences in false-positive and false-negative metrics due to this behavior.

Is this behavior expected? If so, is there a recommended way to evaluate the model to minimize such effects?

To Reproduce
The following code should re-produce the plot above, using the attached audio file below and model versions referenced in the code:

test_clip.zip

import scipy.io.wavfile
import numpy as np
import matplotlib.pyplot as plt
from precise_runner import PreciseEngine

# Set chunk size
chunk_size = 1024

# Load clip
sr, dat = scipy.io.wavfile.read("path/to/attached/wav/file")

# Create versions of clip
version1 = np.concatenate((
        np.zeros(chunk_size*50, dtype=np.int16),
        dat,
    )
)

version2 = np.concatenate((
        np.zeros(chunk_size*51, dtype=np.int16), # this one simply has one more chunk of zeros compared to version1
        dat,
    )
)
     
ps = []
for clip in [version1, version2]:
    # Load Precise model for each clip
    P = PreciseEngine(
        './precise-engine_0.3.0_x86_64/precise-engine/precise-engine',
        "models/hey-mycroft_C1_E6000_B5000_D0.2_R20_S0.8.pb",
        chunk_size=chunk_size*2 # in bytes, not samples
    )
    P.start()
    
    for i in range(0, clip.shape[0]-chunk_size, chunk_size):
        if i < chunk_size*5: # don't store first few predictions to avoid model initialization behavior
            P.get_prediction(clip[i:i+chunk_size].tobytes())
            continue
        else:
            ps.append(P.get_prediction(clip[i:i+chunk_size].tobytes()))
        
    P.stop()
    
plt.plot(ps)
plt.xlabel("Chunk Index")
plt.ylabel("Model Score")

Expected behavior
Precise should have very similar scores for otherwise identical audio that just occurs at a different position in the audio stream.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions