Skip to content

Recognition Techniques Discussion

Ben Miller edited this page Jan 10, 2022 · 11 revisions

For details on the results of the recognitions, see Recognition Results Explanation.

There are four different types of recognitions implemented. FingerprintRecognizer uses the fingerprinting technique. CorrelationRecognizer uses cross-correlation with the audio arrays. CorrelationSpectrogramRecognizer uses cross-correlation with the audio spectrograms. VisualRecognizer uses either ssim or mse with the audio spectrograms.

All files when read are converted to 1 channel, 16 bit, and normalized. All recognizers except for CorrelationRecognizer default to 44100 hertz; CorrelationRecognizer defaults to 8000 hertz. This gave the best results in all applications. Sample rate can be changed with recognizer.config.sample_rate.

Recognizers have a corresponding Config object that handles all configuration.

For common configuration adjustments, look at the base config class

Recognitions should be done through audalign.recognize

FingerprintRecognizer

FingerprintRecognizer is the most robust, but it isn't perfect for all applications.

Each file can be fingerprinted first, then recognize uses the fingerprints. If the files to be recognized are not already fingerprinted, audalign.recognize with fingerprint them prior to the recognition. Alignments and recognitions are not independent with the exception of fine_align, which always fingerprints each file anew.

A spectrogram is first calculated from the audio files, then peaks are calculated from the spectrogram. A hash is then calculated from the peaks. The structure of the hash is defined by the config's hash_style property (set by recognizer.config.set_hash_style("...")). The four hash_styles are as follows:

  • base hash style consists of two peaks. Two frequencies and a time difference. Creates many matches but is sensitive to noise.
  • panako hash style consists of three peaks. Two differences in frequency, two frequency bands, one time difference ratio. Creates few matches, very resistant to noise.
  • panako_mod hash style consists of three peaks. Two differences in frequency and one time difference ratio. Creates less matches than base, more than panako. moderately resistant to noise
  • base_three hash style consists of three peaks. Three frequencies and two time differences.

Panako hash was the idea of github.com/JorenSix/Panako

Recognize then uses the fingerprints and finds exact matches between files.

Pros

  • insensitive to noise
  • parameters can be adjusted for better accuracy and more fingerprints
  • preset parameters with audalign.set_accuracy()
  • different hashes have slightly different results
  • uses audio "features"
  • fairly fast

Cons

  • higher accuracy is memory intensive
  • requires sound events longer than 1/4 seconds ( bad with bursts )
  • harder to tweak parameters
  • less accurate time-wise ( within roughly 0.04 seconds )

Should be configured with recognizer.config.set_accuracy() with values between 1 and 4 or by changing the hash_style.

CorrelationRecognizer

CorrelationRecognizer uses cross-correlation through scipy.signal.correlate, so it is purely amplitude based. It is able to be much more accurate time-wise, though it is much more sensitive to noise. correlation always returns a result, even if there isn't enough confidence to guarantee it. If there is a known alignment and the fingerprinting technique doesn't return any results, there is a good chance this method returns a correct alignment.

All values passed into recognizer.config.passthrough_args are sent to scipy.signal.find_peaks.

Reducing the sample rate can lead to huge speedups with locality without sacrificing accuracy. Reducing the sample rate also gets rid of higher frequency noise, which can be beneficial. The default sample rate is 8000 hertz, which is much lower than the default for the other methods.

The scaling factor is calculated by: max(correlation) / len(correlation) / 65536. 65536 is the range of 16 bit audio. This also normalizes for longer audio files

Pros

  • very accurate time-wise ( within roughly 0.001 seconds or better )
  • very fast without locality
  • always returns a result
  • not much to configure
  • not memory intensive

Cons

  • sensitive to Noise
  • locality takes longer the smaller the locality window

Values of note to configure for this technique are sample_rate and locality.

CorrelationSpectrogramRecognizer

CorrecognizeSpectrogramRecognizer uses cross-correlation through scipy.signal.correlate on the spectrograms of the audio files. It has the same accuracy time-wise as fingerprinting. Correlation_spectrogram also always returns a result, even if there isn't enough confidence to guarantee it. This method is more feature based and better able to pull a feature out of the signal than regular correlation, but faces the same problems with spectrograms as fingerprinting.

All values passed into recognizer.config.passthrough_args are sent to scipy.signal.find_peaks.

Reducing the sample rate mainly gets rid of higher frequency noise, which can be beneficial, and it doesn't affect speed much if at all. The default sample rate is 44100 hertz.

The scaling factor is calculated by: max(correlation) / len(correlation) / recognizer.config.fft_window_size / 100. This normalizes for audio file length and the 100 gives a more understandable range of values.

Pros

  • Better feature extraction than correlation
  • very fast without locality
  • always returns a result
  • not much to configure
  • not memory intensive

Cons

  • More sensitive to noise than fingerprinting
  • Less accurate time-wise than correlation ( within roughly 0.04 seconds )
  • locality takes longer the smaller the locality window

Values of note to configure for this technique are sample_rate, fft_window_size, and locality.

VisualRecognizer

VisualRecognizer operates on the spectrograms of the audio files. It has the same accuracy time-wise as fingerprinting. It can be difficult to get results as small changes to volume_threshold result in huge differences in computation time. This method is able to pull some features out of audio than fingerprinting, but is very susceptible to noise. It relies on longer audio files to cancel out noise interference as a single noise event might have a great ssim value, but only have one match with that offset.

The spectrograms of the audio files are calculated, then windows with max values greater than volume_threshold are compared. Structural similarity is the main confidence measure; mean squared error can also be calculated, though it is not nearly as good as ssim.

volume_floor does an np.clip to bring the lowest values up to the given value.

img_width allows you to compare larger or smaller windows.

cutoff_top (default 200): Cuts off top x rows returned from mlab.specgram in fingerprinter. Those high frequencies are very noisy and disrupt recognitions

Scaling the spectrogram seemed like an exciting course of action, but it ended up only hurting results. I invite you to mess with it. Hopefully it proves more useful to you than it did to me.

In order to get better results with this method than the others, I usually had to adjust the volume_threshold low enough to take around 30 minutes to calculate. The computation time was only sometimes worth it as it seemed to be accurate about 40% of the time. That was on cases where correlation and fingerprinting was unable to get any correct results at all, so it is significantly better.

Pros

  • Can extract features better than other methods
  • Not very memory intensive compared to fingerprinting

Cons

  • requires longer audio files
  • requires lots of parameter tuning
  • Most sensitive to noise
  • Less accurate time-wise than correlation ( within roughly 0.04 seconds )
  • Good accuracy requires long computations ( minutes up to an hour )

Values of note to configure for this technique are sample_rate, fft_window_size, and locality. img_width: float = 1.0, volume_threshold: float = 215.0, volume_floor: float = 10.0, vert_scaling: float = 1.0, horiz_scaling: float = 1.0, calc_mse: bool = False, cutoff_top: int = 200, freq_threshold: int = 100, sample_rate = 44100, fft_window_size = 4096