Skip to content

How to set hyperparameters for speaker diarization pipeline? #1579

Closed
@sunraymoonbeam

Description

I am currently working on a speaker diarization task for classroom discussions without labeled data. To assess the pipeline's performance, I rely on two methods: listening manually and using intrinsic measures such as clustering metrics. The pipeline, when used out of the box, doesn't perform well. Some segments contain background noise, while others are very short (e.g., 0.1 seconds).

I want to improve the diarization pipeline's performance by tweaking hyperparameters. I know about the hyperparameters for segmentation (threshold, min_duration_off, and min_duration_on) and clustering (method, min_cluster_size, and threshold). However, I'm having trouble instantiating hyperparameters for the segmentation model.

Here's my attempt:

pretrained_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=True)
default_hyperparameters = pretrained_pipeline.parameters(instantiated=True)
for param, value in default_hyperparameters.items():
    print(f"{param}: {value}")

Screenshot 2023-12-04 at 17 36 19

The output only shows one tunable hyperparameter for segmentation (min_duration_off). After some investigation, I discovered that using pyannote/segmentation-3.0 for the segmentation model results in only min_duration_off being visible. However, when using pyannote.audio.pipelines.SpeakerDiarization pipeline with the default segmentation model pyannote/segmentation@2022.07, the threshold activation parameter is available.

Screenshot 2023-12-04 at 17 35 30

I'm curious about the difference between these segmentation models. Additionally, I noticed that the VAD pipeline has min_duration_on, but the speaker diarization pipeline does not (which I would like to remove those short speech segments). Initially, I performed each task separately (VAD -> Embedding of speech segments -> Clustering) instead of using the pipeline. My understanding is that the VAD pipeline doesn't account for speaker change detection and only detects regions of speech, which is why I switched back to the pipeline for easy inference and testing.

### Questions

  1. What is the difference between the segmentation models and why do they have different settable hyper-parameters?
  2. Is it possible to set "min_duration_on" for the speaker diarization pipeline?
  3. What is the difference between VAD pipeline and how the segmentation model works in the Speaker Diarization Pipeline? My understanding is that the VAD pipeline uses the maximum over the speaker axis for each frame, but the Segmentation model performs speaker change detection by taking the absolute value of the first derivative over the time axis, and take the maximum value over the speaker axis.

Can you provide insights into these issues and advice me on how to proceed for playing around with the hyperparameters to improve my performance?

Warm regards,
Zack

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions