- Run model
- Cache models
- Save output to file
- Diarization with preset speakers via audio samples - https://github.com/nvidia-riva/tutorials/blob/main/asr-speaker-diarization.ipynb - https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html#speech-recognition-with-vad-and-speaker-diarization - https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/intro.html - https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_recognition/intro.html - https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_recognition/models.html - https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_recognition/models.html - https://www.alphaneural.io/assets/nvidia/speakerverification_en_titanet_large - https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speaker_tasks/recognition/speaker_reco.py - https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/speaker_tasks/recognition - https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/titanet_large?version=v1 - https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/ecapa_tdnn?version=1.16.0 - https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1#%F0%9F%94%AC-for-more-detailed-evaluations-der - figure out wtf this is and if it will help https://docs.nvidia.com/nim/riva/asr/latest/pipeline-configuration.html - https://github.com/NVIDIA-NeMo/NeMo/blob/stable/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
- Batch process audio files - https://docs.nvidia.com/nim/riva/asr/latest/performance.html - https://docs.nvidia.com/nim/riva/asr/latest/deploy-helm.html (for more than 2 gpu's)
- Keep model loaded to save time
- Better approach? https://edemiraydin.medium.com/unlocking-the-power-of-speech-ai-a-step-by-step-guide-to-integrating-nvidia-riva-nims-with-llm-rag-95bd92fe06a7
- OR this? - https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1 - https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1#method-2-use-nemo-example-file-in-nvidianemo
# UV example
uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
uv run basic.py
# Python example
python3 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
python3 basic.py
export NGC_API_KEY=nvapi-????
# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc
# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
chmod 777 $LOCAL_NIM_CACHE
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
docker run -it --rm \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR=name=parakeet-0-6b-ctc-en-us,mode=ofl,bs=1 \
-v ~/.cache/nim:/opt/nim/.cache \
nvcr.io/nim/nvidia/parakeet-0-6b-ctc-en-us:latest
# check api is working
curl -X 'GET' 'http://localhost:9000/v1/health/ready'
# test a file out
curl -s http://0.0.0.0:9000/v1/audio/transcriptions -F language=en \
-F file="@en-US_sample.wav"
conda create -n cuda128_py312_para python=3.12
conda activate cuda128_py312_para
conda install cuda=12.8 -c nvidia/label/cuda-12.8.1
# Check the installed CUDA version
nvcc --version
pip install nvidia-riva-client IPython
python3 test.py
docker run -it --rm \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=14GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR=name=parakeet-0-6b-ctc-en-us,mode=ofl,diarizer=sortformer,vad=silero \
-v ~/.cache/nim:/opt/nim/.cache \
nvcr.io/nim/nvidia/parakeet-0-6b-ctc-en-us:latest
https://github.com/nvidia-riva https://github.com/nvidia-riva/tutorials https://huggingface.co/collections/nvidia/parakeet https://github.com/nvidia-riva/python-clients/tree/main https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-performance.html https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-pipeline-configuration.html https://docs.nvidia.com/nim/riva/asr/latest/configuration.html https://docs.nvidia.com/nim/riva/asr/latest/getting-started-wsl.html https://docs.nvidia.com/nim/riva/asr/latest/getting-started.html Response object: https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#_CPPv426StreamingRecognitionResult (This response object is dog shit, fuck nvidia) https://docs.nvidia.com/deeplearning/riva/archives/160-b/user-guide/docs/protobuf-api/jarvis_asr.proto.html#_CPPv48WordInfo Only useful way of parsing response: https://www.google.com/search?q=process+nvidia+riva+RecognizeResponse audio chunk iterator: https://github.com/nvidia-riva/python-clients/blob/main/riva/client/asr.py#L49
- Currently, Sortformer speaker diarization is supported only with the Parakeet-CTC and Conformer-CTC ASR models in streaming mode. (https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html#streaming-recognition-with-speaker-diarization)