A Swift library for Voice Activity Detection using NVIDIA NeMo's MarbleNet model. Designed for iOS/macOS applications.
- Frame-level voice activity detection using MarbleNet
- Automatic speech segment extraction with configurable thresholds
- Support for variable-length audio input (1s, 5s, 20s chunks)
- High performance CoreML inference
- Built on NeMoFeatureExtractor for accurate mel-spectrogram computation
Add to your Package.swift:
dependencies: [
.package(url: "https://github.com/Otosaku/NeMoVAD-iOS.git", from: "1.0.0")
]Or in Xcode: File -> Add Package Dependencies -> Enter repository URL.
import NeMoVAD
// Create VAD instance
let vad = try NeMoVAD()
// Process audio samples (Float32, mono, 16kHz)
let audioSamples: [Float] = loadAudio() // Your audio loading code
let result = try vad.process(samples: audioSamples)
// Get speech segments
for segment in result.segments {
print("Speech from \(segment.start)s to \(segment.end)s")
}
// Get frame-level probabilities
for frame in result.frames {
print("Time: \(frame.time)s, Speech probability: \(frame.speechProbability)")
}// Check if audio contains speech
let hasSpeech = try vad.containsSpeech(samples: audioSamples)
// Get only speech segments
let segments = try vad.detectSpeechSegments(samples: audioSamples)// Sensitive configuration (catches more speech)
let sensitiveVAD = try NeMoVAD(config: .sensitive)
// Aggressive configuration (stricter detection)
let aggressiveVAD = try NeMoVAD(config: .aggressive)
// Custom configuration
let customConfig = VADConfig(
speechThreshold: 0.5, // Probability threshold for speech
minSpeechDuration: 0.1, // Minimum speech segment duration (seconds)
minSilenceDuration: 0.1, // Minimum silence to split segments (seconds)
speechPad: 0.0 // Padding around speech segments (seconds)
)
let customVAD = try NeMoVAD(config: customConfig)| Preset | Threshold | Min Speech | Min Silence | Use Case |
|---|---|---|---|---|
.default |
0.5 | 0.1s | 0.1s | Balanced detection |
.sensitive |
0.35 | 0.05s | 0.15s | Catch more speech, may include noise |
.aggressive |
0.65 | 0.2s | 0.05s | Stricter, cleaner segments |
struct VADResult {
let segments: [VADSegment] // Speech segments
let frames: [VADFrame] // Frame-level probabilities
let audioDuration: Double // Total audio duration
var speechDuration: Double // Total speech duration
var speechRatio: Double // Speech ratio (0.0 - 1.0)
}struct VADSegment {
let start: Double // Start time in seconds
let end: Double // End time in seconds
var duration: Double // Segment duration
}struct VADFrame {
let time: Double // Time offset in seconds
let speechProbability: Float // Speech probability (0.0 - 1.0)
var isSpeech: Bool // true if probability > 0.5
}- Model: MarbleNet (NVIDIA NeMo)
- Sample Rate: 16kHz mono
- Frame Rate: ~50 fps (20ms per frame)
- Supported Input Lengths:
- 1 second (102 mel frames)
- 5 seconds (502 mel frames)
- 20 seconds (2002 mel frames)
- Longer audio: Automatically chunked with 50% overlap
- iOS 16.0+ / macOS 13.0+
- Swift 5.9+
- NeMoFeatureExtractor-iOS - Mel-spectrogram feature extraction
MIT License
- NVIDIA NeMo - Original MarbleNet model
- Apple CoreML for optimized inference