-
batch_diarize_audio(input_audios, model_name="medium.en", stemming=False)
: This function takes a list of input audio files, processes them, and generates speaker-aware transcripts and SRT files for each input audio file. It maintains consistent speaker numbering across all files in the batch and labels the most-spoken speaker as the 'instructor'. -
diarize_audio(input_audio, model_name="medium.en", stemming=False)
: This function takes a single input audio file and processes it to extract speaker-wise word mappings and sentence mappings. It generates speaker-aware transcripts and SRT files for the input audio file. -
Helper functions: The code also includes several helper functions for processing the audio files, extracting speaker information, and generating output files.
- Import the necessary libraries and functions:
from batch_diarize import batch_diarize_audio
- Prepare a list of input audio files:
input_audios = ["audio1.wav", "audio2.wav", "audio3.wav"]
- Call the
batch_diarize_audio
function with the list of input audio files:
results = batch_diarize_audio(input_audios)
-
The
results
variable will contain a list of tuples, where each tuple contains the following information for each input audio file:- input_audio: The input audio file name
- wsm: Speaker-wise word mappings
- ssm: Speaker-wise sentence mappings
- instructor_speaker_number: The speaker number assigned to the instructor
- instructor_embeddings: The embeddings of the instructor
-
The code will also generate output files with speaker-aware transcripts (in TXT format) and subtitles (in SRT format) for each input audio file. The output files will have the same name as the input audio files, with the corresponding file extensions (.txt and .srt).
from batch_diarize import batch_diarize_audio
input_audios = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = batch_diarize_audio(input_audios)
for result in results:
print(f"Input audio: {result[0]}")
print(f"Instructor speaker number: {result[3]}")
print(f"Instructor embeddings: {result[4]}")
print("Speaker-wise sentence mappings:")
for ssm in result[2]:
print(ssm)
This example demonstrates how to use the batch_diarize_audio
function to process a list of input audio files and generate speaker-aware transcripts and SRT files. It also prints the instructor speaker number, instructor embeddings, and speaker-wise sentence mappings for each input audio file.
Forked From
Speaker Diarization pipeline based on OpenAI Whisper I'd like to thank @m-bain for Wav2Vec2 forced alignment, @mu4farooqi for punctuation realignment algorithm
This work is based on OpenAI's Whisper , Nvidia NeMo , and Facebook's Demucs
Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!
This repository combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.
Whisper, WhisperX and NeMo parameters are coded into diarize.py and helpers.py, I will add the CLI arguments to change them later
python diarize.py -a AUDIO_FILE_NAME
- Only tested on english but several other languages are supported
- Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
- There might be some errors, please raise an issue if you encounter any.
- Implement a maximum length per sentence for SRT
- Use Whisper word-level timestamps for languages that are not in WhisperX
- Improve performance using Faster Whisper or Batched Inference
Special Thanks for @adamjonas for supporting this project