Identification model

1. Audio signal processing

First, we process the wav file through functions in order to obtain a mel-spectrogram that we will feed into the model.

You can find very good explanations about audio signal processing for machine learning on this youtube channel : Audio Signal Processing for Machine Learning

Spectrogram helping class is called with parameters regarding to audio signal:

Spectrogram(nn.Module):
    def __init__(self, n_fft=2048, hop_length=None, win_length=None, 
        window='hann', center=True, pad_mode='reflect', power=2.0, 
        freeze_parameters=True)

n_fft = window size applied on a frame of the sound signal hop_length = number of samples the window slides on the sound signal before applying window function window = window function applied (‘hann’)

this function calls the Short Time Fourier Transform function (STFT) in order to return the spectrogram.

Figure 1 : Example of feature extraction process based on the short-time Fourier transform (STFT)

Example of feature extraction process based on the short-time Fourier transform (STFT). Power spectrum density P s (m, k) of each signal frame s(m, k) is computed using the STFT. The feature vector x m is constructed by extracting six components from lower subband (15.1375–20.508 Hz) and six components from upper subband (38.574–84.961 Hz). The discrete frequencies are obtained with a sampling frequency of 250 Hz and a frame length of 512 samples.

Source : FPGA Implementation of Blue Whale Calls Classifier Using High-Level Programming Tool - Scientific Figure on ResearchGate

↓

Spectrogram

↓

The spectrogram is passed on LogmelFilterBank in order to obtain a logmel spectrogram :

class LogmelFilterBank(nn.Module)

self.melW = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels,
            fmin=fmin, fmax=fmax).T

self.melW = nn.Parameter(torch.Tensor(self.melW))

power_to_db(self, input) (S, ref=1.0, amin=1e-10, top_db=80.0)

power_to_db converts a power spectrogram (amplitude squared) to decibel (dB) units. This computes the scaling 10 * log10(S / ref) in a numerically stable way.

Logmel spectrogram

↓

Batch Normalization 2D

↓

Possibility of SpecAugmentation (not used in our case)

SpecAugmentation(
            time_drop_width=64,
            time_stripes_num=2,
            freq_drop_width=8,
            freq_stripes_num=2)

Reference paper : Specaugment: A simple data augmentation method for automatic speech recognition.

SpecAugmentation sets a time dropper and a frequency dropper then the DropStripes function does it.

↓

At this point we have: x (batch_size, 1, time_steps, mel_bins), frames_num

↓

2nd dimension of x is expanded from 1 to 3 channels and x is fed to DenseNet121. DenseNet121 is used as feature extractor.

2. CNN feature extractor: DenseNet121

Densenet121 Reference paper : Densely Connected Convolutional Networks

Figure 2 : DenseNet architecture

↓

After passing through densenet, frequency axis is aggregated (mean)

# Aggregate in frequency axis
x = torch.mean(x, dim=3)

↓

Pooling layers max and avg, then they are added

x1 = F.max_pool1d(x, kernel_size=3, stride=1, padding=1)
x2 = F.avg_pool1d(x, kernel_size=3, stride=1, padding=1)
x = x1 + x2

↓

Dropout p = 0.5

↓

FC1

↓

ReLU

↓

Dropout p = 0.5

↓

Attention Block: classification for each segment

3. Classification for each segment with a self-attention mechanism

(clipwise_output, norm_att, segmentwise_output) = self.att_block(x)

Figure 3 : Example of a self-attention mechanism in a CNN
(source : Non-local Neural Networks)

Weights are initialized with Xavier initialization. Attention block uses 1D convolution layer with a kernel size of 1 (att() and cla()). Then it outputs self attention map and classification :

def forward(self, x):
# x: (n_samples, n_in, n_time)

        #calculate self-attention map along time axis: 
        norm_att = torch.softmax(torch.tanh(self.att(x)), dim=-1)
        # calculates segment wise classification result:        
        cla = self.nonlinear_transform(self.cla(x))
        # attention aggregation is performed to get clip wise prediction: 
        x = torch.sum(norm_att * cla, dim=2) 
        
        return x, norm_att, cla

A segment prediction counts for a frame prediction. So interpolate() and pad_framewise_output() helps to get the right dimensions back. (not used in our case)

↓

clipwise predition is saved in the output dictionnary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Identification model

1. Audio signal processing

2. CNN feature extractor: DenseNet121

3. Classification for each segment with a self-attention mechanism

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Identification model

1. Audio signal processing

2. CNN feature extractor: DenseNet121

3. Classification for each segment with a self-attention mechanism