Description
Describe the bug
When using the Python bindings for Precise, I've noticed that the model predictions can vary substantially depending on where in the input audio the wake word is located. For example, The plot below shows the default "hey mycroft" model score for two repetitions of the same audio clip, where the only difference is that the second clip has one additional frame (1024 samples) of zero-padding compare to the first clip:
I'm currently doing some evaluation of Precise compared to other wakeword solutions, and this behavior is making it difficult to accurately assess performance as the length and padding of the test clips can cause significant differences in false-positive and false-negative metrics due to this behavior.
Is this behavior expected? If so, is there a recommended way to evaluate the model to minimize such effects?
To Reproduce
The following code should re-produce the plot above, using the attached audio file below and model versions referenced in the code:
import scipy.io.wavfile
import numpy as np
import matplotlib.pyplot as plt
from precise_runner import PreciseEngine
# Set chunk size
chunk_size = 1024
# Load clip
sr, dat = scipy.io.wavfile.read("path/to/attached/wav/file")
# Create versions of clip
version1 = np.concatenate((
np.zeros(chunk_size*50, dtype=np.int16),
dat,
)
)
version2 = np.concatenate((
np.zeros(chunk_size*51, dtype=np.int16), # this one simply has one more chunk of zeros compared to version1
dat,
)
)
ps = []
for clip in [version1, version2]:
# Load Precise model for each clip
P = PreciseEngine(
'./precise-engine_0.3.0_x86_64/precise-engine/precise-engine',
"models/hey-mycroft_C1_E6000_B5000_D0.2_R20_S0.8.pb",
chunk_size=chunk_size*2 # in bytes, not samples
)
P.start()
for i in range(0, clip.shape[0]-chunk_size, chunk_size):
if i < chunk_size*5: # don't store first few predictions to avoid model initialization behavior
P.get_prediction(clip[i:i+chunk_size].tobytes())
continue
else:
ps.append(P.get_prediction(clip[i:i+chunk_size].tobytes()))
P.stop()
plt.plot(ps)
plt.xlabel("Chunk Index")
plt.ylabel("Model Score")
Expected behavior
Precise should have very similar scores for otherwise identical audio that just occurs at a different position in the audio stream.