How to set hyperparameters for speaker diarization pipeline?

I am currently working on a speaker diarization task for classroom discussions without labeled data. To assess the pipeline's performance, I rely on two methods: listening manually and using intrinsic measures such as clustering metrics. The pipeline, when used out of the box, doesn't perform well. Some segments contain background noise, while others are very short (e.g., 0.1 seconds).

I want to improve the diarization pipeline's performance by tweaking hyperparameters. I know about the hyperparameters for segmentation (_threshold_, _min_duration_off_, and _min_duration_on_) and clustering (_method_, _min_cluster_size_, and _threshold_). However, I'm having trouble instantiating hyperparameters for the segmentation model.

Here's my attempt:

```
pretrained_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=True)
default_hyperparameters = pretrained_pipeline.parameters(instantiated=True)
for param, value in default_hyperparameters.items():
    print(f"{param}: {value}")
```
![Screenshot 2023-12-04 at 17 36 19](https://github.com/pyannote/pyannote-audio/assets/98078901/509a2a22-3e8d-4d3c-923f-c643ca4d4bc2)

The output only shows one tunable hyperparameter for segmentation (_min_duration_off_). After some investigation, I discovered that using **_pyannote/segmentation-3.0_** for the segmentation model results in only _min_duration_off_ being visible. However, when using `pyannote.audio.pipelines.SpeakerDiarization` pipeline with the default segmentation model **_pyannote/segmentation@2022.07_**, the _threshold_ activation parameter is available.

![Screenshot 2023-12-04 at 17 35 30](https://github.com/pyannote/pyannote-audio/assets/98078901/cc435f7d-f90f-459e-832a-496dcd4897ea)

I'm curious about the difference between these segmentation models. Additionally, I noticed that the VAD pipeline has _min_duration_on_, but the speaker diarization pipeline does not (which I would like to remove those short speech segments). Initially, I performed each task separately (VAD -> Embedding of speech segments -> Clustering) instead of using the pipeline. My understanding is that the VAD pipeline doesn't account for speaker change detection and only detects regions of speech, which is why I switched back to the pipeline for easy inference and testing.

**### Questions**
1. What is the difference between the segmentation models and why do they have different settable hyper-parameters?
2. Is it possible to set "min_duration_on" for the speaker diarization pipeline?
3. What is the difference between VAD pipeline and how the segmentation model works in the Speaker Diarization Pipeline? My understanding is that the VAD pipeline uses the maximum over the speaker axis for each frame, but the Segmentation model performs speaker change detection by taking the absolute value of the first derivative over the time axis, and take the maximum value over the speaker axis.

Can you provide insights into these issues and advice me on how to proceed for playing around with the hyperparameters to improve my performance?

Warm regards,
Zack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set hyperparameters for speaker diarization pipeline? #1579

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development