Description
I am currently working on a speaker diarization task for classroom discussions without labeled data. To assess the pipeline's performance, I rely on two methods: listening manually and using intrinsic measures such as clustering metrics. The pipeline, when used out of the box, doesn't perform well. Some segments contain background noise, while others are very short (e.g., 0.1 seconds).
I want to improve the diarization pipeline's performance by tweaking hyperparameters. I know about the hyperparameters for segmentation (threshold, min_duration_off, and min_duration_on) and clustering (method, min_cluster_size, and threshold). However, I'm having trouble instantiating hyperparameters for the segmentation model.
Here's my attempt:
pretrained_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=True)
default_hyperparameters = pretrained_pipeline.parameters(instantiated=True)
for param, value in default_hyperparameters.items():
print(f"{param}: {value}")
The output only shows one tunable hyperparameter for segmentation (min_duration_off). After some investigation, I discovered that using pyannote/segmentation-3.0 for the segmentation model results in only min_duration_off being visible. However, when using pyannote.audio.pipelines.SpeakerDiarization
pipeline with the default segmentation model pyannote/segmentation@2022.07, the threshold activation parameter is available.
I'm curious about the difference between these segmentation models. Additionally, I noticed that the VAD pipeline has min_duration_on, but the speaker diarization pipeline does not (which I would like to remove those short speech segments). Initially, I performed each task separately (VAD -> Embedding of speech segments -> Clustering) instead of using the pipeline. My understanding is that the VAD pipeline doesn't account for speaker change detection and only detects regions of speech, which is why I switched back to the pipeline for easy inference and testing.
### Questions
- What is the difference between the segmentation models and why do they have different settable hyper-parameters?
- Is it possible to set "min_duration_on" for the speaker diarization pipeline?
- What is the difference between VAD pipeline and how the segmentation model works in the Speaker Diarization Pipeline? My understanding is that the VAD pipeline uses the maximum over the speaker axis for each frame, but the Segmentation model performs speaker change detection by taking the absolute value of the first derivative over the time axis, and take the maximum value over the speaker axis.
Can you provide insights into these issues and advice me on how to proceed for playing around with the hyperparameters to improve my performance?
Warm regards,
Zack