Accepted at Interspeech2025
This project contains the dataset and baseline multimodal classification models described by Paper Refer Annotation Guidline to see the details of the annotaion guidlines.
project/
└── data/
└── Voices-AWS/
├── interview/
│ ├── video/ # Place MP4 files here
│ ├── total_dataset.csv # Annotation data
│ ├── exclusions.csv # Optional: segments to exclude
│ └── raw.csv # Raw data
└── reading/
├── video/ # Place MP4 files here
├── total_dataset.csv # Annotation data
├── exclusions.csv # Optional: segments to exclude
└── raw.csv
git clone https://github.com/mbzuai-nlp/CASA.git
cd CASA
conda create -n casa python=3.12
conda activate casa
pip install -r requirements.txt
-
Download Media Files: Follow the instructions Here
-
Verify Input Data Structure:
data/Voices-AWS/interview/ ├── video/ │ ├── participant1.mp4 │ ├── participant2.mp4 │ └── ... ├── total_dataset_final.csv └── exclusions.csv
Required Files:
total_dataset.csv
: Contains stuttering annotations with columns:- media_file: filename without extension
- item: the group id in the form [media_id-(group_start_time, group_end_time)] after grouping annotations based on region.
- start: start time in milliseconds
- end: end time in milliseconds
- annotator: (A1, A2, A3, Gold) and additional annotator aggrigation methods (BAU, MAS, SAD)
- SR, ISR, MUR, P, B, V, FG, HM, ME, T: stuttering type indicators (0/1) refere to Annotation Guidelines for details
exclusions.csv
: Containes the unannotated regions. (Interviewer part of interview section)
To prepare the data for training run the following command: (Note: This takes ~ 30 mins with a 24 core CPU. It also requires >130GB of memory )
python prepare.py \
--root_dir "/path/to/root/dir" \
--input_dir "/path/to/output/dir" \
--clip_duration 5 \ # duration of each clip in seconds
--overlap 2 \ # the overlap window
--max_workers 16 \ # update this number based on the number of cpu cores
The script generates:
- 5 second audio and video features preprocessed using the respective Wav2vec2 and ViViT Processors
- Labels for each annotator
- Labels for the aggrigation methods (BAU, MAS, SAD, MAJ)
To train the models, use the following command
python train.py \
--modality audio \ # audio, video, multimodal
--dataset_root "/path/to/dataset/dir" \
--dataset_annotator "bau" \ #eg annotator to use to train the models
--output_dir "/path/to/output" \
- Video files should be in MP4 format
- File names in the CSV should match the media files (without extension)
- Start/end times in the CSV should be in milliseconds
If you find this data annotations helpful, please cite our paper: