Skip to content

Latest commit

 

History

History
165 lines (100 loc) · 5.14 KB

readme.md

File metadata and controls

165 lines (100 loc) · 5.14 KB

Identification model

1. Audio signal processing


First, we process the wav file through functions in order to obtain a mel-spectrogram that we will feed into the model.

You can find very good explanations about audio signal processing for machine learning on this youtube channel : Audio Signal Processing for Machine Learning

Spectrogram helping class is called with parameters regarding to audio signal:

Spectrogram(nn.Module):
    def __init__(self, n_fft=2048, hop_length=None, win_length=None, 
        window='hann', center=True, pad_mode='reflect', power=2.0, 
        freeze_parameters=True)

n_fft = window size applied on a frame of the sound signal hop_length = number of samples the window slides on the sound signal before applying window function window = window function applied (‘hann’)

this function calls the Short Time Fourier Transform function (STFT) in order to return the spectrogram.

Figure 1 : Example of feature extraction process based on the short-time Fourier transform (STFT)

Example of feature extraction process based on the short-time Fourier transform (STFT). Power spectrum density P s (m, k) of each signal frame s(m, k) is computed using the STFT. The feature vector x m is constructed by extracting six components from lower subband (15.1375–20.508 Hz) and six components from upper subband (38.574–84.961 Hz). The discrete frequencies are obtained with a sampling frequency of 250 Hz and a frame length of 512 samples.

Source : FPGA Implementation of Blue Whale Calls Classifier Using High-Level Programming Tool - Scientific Figure on ResearchGate

Spectrogram

The spectrogram is passed on LogmelFilterBank in order to obtain a logmel spectrogram :

class LogmelFilterBank(nn.Module)

self.melW = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels,
            fmin=fmin, fmax=fmax).T

self.melW = nn.Parameter(torch.Tensor(self.melW))

power_to_db(self, input) (S, ref=1.0, amin=1e-10, top_db=80.0)

power_to_db converts a power spectrogram (amplitude squared) to decibel (dB) units. This computes the scaling 10 * log10(S / ref) in a numerically stable way.

Logmel spectrogram

Batch Normalization 2D

Possibility of SpecAugmentation (not used in our case)

SpecAugmentation(
            time_drop_width=64,
            time_stripes_num=2,
            freq_drop_width=8,
            freq_stripes_num=2)

Reference paper : Specaugment: A simple data augmentation method for automatic speech recognition.

SpecAugmentation sets a time dropper and a frequency dropper then the DropStripes function does it.

At this point we have: x (batch_size, 1, time_steps, mel_bins), frames_num

2nd dimension of x is expanded from 1 to 3 channels and x is fed to DenseNet121. DenseNet121 is used as feature extractor.

2. CNN feature extractor: DenseNet121


Densenet121 Reference paper : Densely Connected Convolutional Networks

Figure 2 : DenseNet architecture

After passing through densenet, frequency axis is aggregated (mean)

# Aggregate in frequency axis
x = torch.mean(x, dim=3)

Pooling layers max and avg, then they are added

x1 = F.max_pool1d(x, kernel_size=3, stride=1, padding=1)
x2 = F.avg_pool1d(x, kernel_size=3, stride=1, padding=1)
x = x1 + x2

Dropout p = 0.5

FC1

ReLU

Dropout p = 0.5

Attention Block: classification for each segment

3. Classification for each segment with a self-attention mechanism


(clipwise_output, norm_att, segmentwise_output) = self.att_block(x)

Figure 3 : Example of a self-attention mechanism in a CNN
(source : Non-local Neural Networks)

Weights are initialized with Xavier initialization. Attention block uses 1D convolution layer with a kernel size of 1 (att() and cla()). Then it outputs self attention map and classification :

def forward(self, x):
# x: (n_samples, n_in, n_time)

        #calculate self-attention map along time axis: 
        norm_att = torch.softmax(torch.tanh(self.att(x)), dim=-1)
        # calculates segment wise classification result:        
        cla = self.nonlinear_transform(self.cla(x))
        # attention aggregation is performed to get clip wise prediction: 
        x = torch.sum(norm_att * cla, dim=2) 
        
        return x, norm_att, cla

A segment prediction counts for a frame prediction. So interpolate() and pad_framewise_output() helps to get the right dimensions back. (not used in our case)

clipwise predition is saved in the output dictionnary