For Vietnamese student, you can read the Nhận diện covid qua tiếng ho.pdf
and Covid19_slide.pdf
which is written in Vietnamese. For foreign reader, I wrote all the important information in coughvid-19-crnn-attention.ipynb
, Mechanism of cough.pptx
, and my Git Repo COVID-19-Cough-Classification-phase-1
The Kaggle notebook: https://www.kaggle.com/bomaich/coughvid-19-crnn-attention
You may find the advantages of the Covid classification system based on cough sound, the background technique like scaling, K-fold... or the way I implemented it in my Git COVID-19-Cough-Classification-phase-1. But the last git has some huge flaws due to the lack of experiences about cough or I have to depend much on the processed dataset (not sound files). In this Git, I recommend reading this Git in conjunction with coughvid-19-crnn-attention.ipynb
- 1. Cough mechanism
- 2. Primary features
- 3. Model
- 4. Data augmentation
- 6. K - fold cross validation
- 7. Result
I won't talk much about this part cause you can have detail information in Mechanism of cough.pptx
and this Cough sound analysis and objective correlation with spirometry and clinical diagnosis. To sum up, different diseases create different coungh sounds. And those sounds are different with:
- The energy of cough sequence
- The energy distribution between cough bouts
- The sound of breath
- The duration of the coough or breath and so on
So I discovered the author of the dataset of my last Git [COVID-19-Cough-Classification-phase-1] calculate the mean value of all the features on all of the time series, which was a significant error. Like we just have one ZCR mean value on the whole sound. It's not logical because ZCRs are often utilized in association with energy to determine when there is sound and when there is quiet
"The results suggest that zero crossing rates are low for voiced part and high for unvoiced part where as the energy is high for voiced part and low for unvoiced part."
Separation of Voiced and Unvoiced using Zero crossing rate and Energy of the Speech Signal
From the base knowledge, I decided to use CRNN (Because of it's outperformance inn dealing with series data) and preprocess features in a different way
I will extract 2D features like:
- Mel-frequency Spectrogram
- Chroma
And then combine them into image to feed into the model. I may add 1D features and mean them through time (not mean them on the whole audio like phase 1):
- MFCCs (I think it's a lower resolution and noise-free version of Mel spectrogram)
- Spectral Centroid
- Spectral Bandwidth
- Spectral Roll-off
- ZCR + energy
Finally, I concatenated all of those features into one image like this below:
Figure 1. Features data
I'd want to go through the Mel-frequency Spectrogram and MFCCs in further detail. A sound is just a collection of waveforms with varying Amplitude and Frequency. Ah, it sounds familiar, don't it? Yes Fourier Transform. So we utilize Fourier to encode a sound wave into a picture, which is a Mel-frequency Spectrogram with frequency as the vertical axis, time as the horizontal axis, and magnitude as the sample value associated with each time and frequency coordinate, and yes, we can restore any sound with that image.
The reason why I said MFCCs is a lower resolution and noise-free version of Mel spectrogram is that from the Mel-spectrogram we can remove parts of the frequency and retain a range of frequency that seems to carry important information (That's how we remove noise = high frequency)
Figure 1. Model Structure
The Features images are then normalized, resized, padding and transposed before being fed into the model
The four main stages of the model:
- CNN: extract image features (I use CNN of VGG16 with 7 interlaced connected layers of Convolutional, Max pooling and Batch normalization )
- Bi-LSTM: work so well with sequence data. Overcome the RNN drawbacks (vanishing or exploding gradient). The prefix "Bi" stands for "BiDirectional," which indicates that the model can update the current information using both prior and subsequent information.
- Attention: Note that even LSTM struggles with information that appears to be far away. Attention is here to help
- Fully - connected layer: to classify the matrix of the LSTM output into 2 classes (covid and non-covid)
As I said in the Cough mechanism.There are several elements that might influence the outcome. As a result, using MIX UP (Cut the cough sequeces of several people and then randomly connect them) or SMOTE like phase 1 to deal with the data imbalanced problem isn't a good idea. Cause they will create fake data that may never exist in real life and that doesn't help at all.
Methods that I :
- Time Shift
- Adding background noise
- Stretching the sound (just a little bit)
- Changing Gain
Note: You have to use data augmentation on just the train and valid set, not the test set
AUC increase from 0.67 to 0.69 after using Data Augmentation
Figure 1. AUC on original data
Figure 1. AUC on augmented data