Skip to content

Swift library for Voice Activity Detection (VAD) using NVIDIA NeMo MarbleNet model converted to CoreML. Detect speech segments in real-time on iOS/macOS with high accuracy and low latency.

Notifications You must be signed in to change notification settings

Otosaku/NeMoVAD-iOS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeMoVAD

A Swift library for Voice Activity Detection using NVIDIA NeMo's MarbleNet model. Designed for iOS/macOS applications.

Features

  • Frame-level voice activity detection using MarbleNet
  • Automatic speech segment extraction with configurable thresholds
  • Support for variable-length audio input (1s, 5s, 20s chunks)
  • High performance CoreML inference
  • Built on NeMoFeatureExtractor for accurate mel-spectrogram computation

Installation

Swift Package Manager

Add to your Package.swift:

dependencies: [
    .package(url: "https://github.com/Otosaku/NeMoVAD-iOS.git", from: "1.0.0")
]

Or in Xcode: File -> Add Package Dependencies -> Enter repository URL.

Usage

Basic Usage

import NeMoVAD

// Create VAD instance
let vad = try NeMoVAD()

// Process audio samples (Float32, mono, 16kHz)
let audioSamples: [Float] = loadAudio() // Your audio loading code
let result = try vad.process(samples: audioSamples)

// Get speech segments
for segment in result.segments {
    print("Speech from \(segment.start)s to \(segment.end)s")
}

// Get frame-level probabilities
for frame in result.frames {
    print("Time: \(frame.time)s, Speech probability: \(frame.speechProbability)")
}

Quick Speech Detection

// Check if audio contains speech
let hasSpeech = try vad.containsSpeech(samples: audioSamples)

// Get only speech segments
let segments = try vad.detectSpeechSegments(samples: audioSamples)

Custom Configuration

// Sensitive configuration (catches more speech)
let sensitiveVAD = try NeMoVAD(config: .sensitive)

// Aggressive configuration (stricter detection)
let aggressiveVAD = try NeMoVAD(config: .aggressive)

// Custom configuration
let customConfig = VADConfig(
    speechThreshold: 0.5,      // Probability threshold for speech
    minSpeechDuration: 0.1,    // Minimum speech segment duration (seconds)
    minSilenceDuration: 0.1,   // Minimum silence to split segments (seconds)
    speechPad: 0.0             // Padding around speech segments (seconds)
)
let customVAD = try NeMoVAD(config: customConfig)

Configuration Presets

Preset Threshold Min Speech Min Silence Use Case
.default 0.5 0.1s 0.1s Balanced detection
.sensitive 0.35 0.05s 0.15s Catch more speech, may include noise
.aggressive 0.65 0.2s 0.05s Stricter, cleaner segments

Result Types

VADResult

struct VADResult {
    let segments: [VADSegment]  // Speech segments
    let frames: [VADFrame]      // Frame-level probabilities
    let audioDuration: Double   // Total audio duration
    var speechDuration: Double  // Total speech duration
    var speechRatio: Double     // Speech ratio (0.0 - 1.0)
}

VADSegment

struct VADSegment {
    let start: Double    // Start time in seconds
    let end: Double      // End time in seconds
    var duration: Double // Segment duration
}

VADFrame

struct VADFrame {
    let time: Double              // Time offset in seconds
    let speechProbability: Float  // Speech probability (0.0 - 1.0)
    var isSpeech: Bool            // true if probability > 0.5
}

Technical Details

  • Model: MarbleNet (NVIDIA NeMo)
  • Sample Rate: 16kHz mono
  • Frame Rate: ~50 fps (20ms per frame)
  • Supported Input Lengths:
    • 1 second (102 mel frames)
    • 5 seconds (502 mel frames)
    • 20 seconds (2002 mel frames)
  • Longer audio: Automatically chunked with 50% overlap

Requirements

  • iOS 16.0+ / macOS 13.0+
  • Swift 5.9+

Dependencies

License

MIT License

Acknowledgments

  • NVIDIA NeMo - Original MarbleNet model
  • Apple CoreML for optimized inference

About

Swift library for Voice Activity Detection (VAD) using NVIDIA NeMo MarbleNet model converted to CoreML. Detect speech segments in real-time on iOS/macOS with high accuracy and low latency.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages