Skip to content

FluidInference/FluidAudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

89 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

banner.png

FluidAudio - Swift Speaker Diarization & ASR on CoreML

Swift Platform Discord Models

Fluid Audio is a Swift framework for fully local, real-time audio processing on Apple devices. It provides state-of-the-art speaker diarization, ASR, and voice activity detection through open-source models (MIT/Apache 2.0 licensed) that we've converted to Core ML.

Our models are optimized for background processing on CPU, avoiding GPU/MPS/Shaders to ensure reliable performance. While we've tested CPU/GPU-based alternatives, they proved too slow or resource-intensive for our real-time requirements.

For custom use cases and feedback, reach out on Discord.

Features

  • Automatic Speech Recognition (ASR): Parakeet TDT-0.6b model with Token Duration Transducer support for real-time transcription
  • State-of-the-Art Diarization: Research-competitive speaker separation with optimal speaker mapping
  • Voice Activity Detection (VAD): Production-ready VAD with 98% accuracy using CoreML models and adaptive thresholding
  • Speaker Embedding Extraction: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification
  • CoreML Models: Native Apple CoreML backend with custom-converted models optimized for Apple Silicon
  • Open-Source Models: All models are publicly available on HuggingFace - converted and optimized by our team. Permissive licenses.
  • Real-time Processing: Designed for real-time workloads but also works for offline processing
  • Cross-platform: Full support for macOS 14.0+ and iOS 17.0+ and any Apple Sillicon device
  • Apple Neural Engine Optimized: Models run efficiently on Apple's ANE for maximum performance with minimal power consumption

Installation

Add FluidAudio to your project using Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.3"),
],

Important: When adding FluidAudio as a package dependency, only add the library to your target (not the executable). Select "FluidAudio" library in the package products dialog and add it to your app target.

Documentation

See the public DeepWiki docs: https://deepwiki.com/FluidInference/FluidAudio

MCP

The repo is indexed by DeepWiki - the MCP server gives your coding tool access to the docs already.

For most clients:

{
  "mcpServers": {
    "deepwiki": {
      "url": "https://mcp.deepwiki.com/mcp"
    }
  }
}

For claude code:

claude mcp add -s user -t http deepwiki https://mcp.deepwiki.com/mcp

๐Ÿš€ Roadmap

Coming Soon:

  • System Audio Access: Tap into system audio via CoreAudio for MacOS :), don't need to use ScreenCaptureKit or Blackhole

๐ŸŽฏ Performance

AMI Benchmark Results (Single Distant Microphone) using a subset of the files:

  • DER: 17.7% - Competitive with Powerset BCE 2023 (18.5%)

  • JER: 28.0% - Outperforms EEND 2019 (25.3%) and x-vector clustering (28.7%)

  • RTF: 0.02x - Real-time processing with 50x speedup

  • Efficient Computing: Runs on Apple Neural Engine with zero performance trade-offs

  RTF = Processing Time / Audio Duration

  With RTF = 0.02x:
  - 1 minute of audio takes 0.02 ร— 60 = 1.2 seconds to process
  - 10 minutes of audio takes 0.02 ร— 600 = 12 seconds to process

  For real-time speech-to-text:
  - Latency: ~1.2 seconds per minute of audio
  - Throughput: Can process 50x faster than real-time
  - Pipeline impact: Minimal - diarization won't be the bottleneck

๐ŸŽ™๏ธ Voice Activity Detection (VAD)

  • 98% Accuracy on MUSAN dataset at optimal threshold (0.445)
  • CoreML Pipeline: STFT โ†’ Encoder โ†’ RNN โ†’ Enhanced Fallback architecture
  • Noise Robustness: SNR filtering (6.0 dB threshold), spectral analysis, temporal smoothing

Model Sources & Datasets:

Technical Achievements:

  • Model Conversion: Solved PyTorch โ†’ CoreML limitations with custom fallback algorithm
  • Performance: Real-time processing with minimal latency overhead
  • Integration: Ready for embedding into diarization pipeline

๐Ÿ—ฃ๏ธ Automatic Speech Recognition (ASR)

  • Model: Parakeet TDT-0.6b v2 - Token Duration Transducer architecture
  • Real-time Factor: Optimized for real-time transcription with chunking support
  • LibriSpeech Benchmark: Competitive WER (Word Error Rate) performance
  • Streaming Support: Process audio in chunks for live transcription

Model Sources:

Technical Features:

  • Chunked Processing: Support for real-time audio streaming with configurable chunk sizes
  • Dual Audio Sources: Separate decoder states for microphone and system audio
  • Text Normalization: Post-processing for improved accuracy
  • ANE Optimization: Fully optimized for Apple Neural Engine execution

๐Ÿข Real-World Usage

FluidAudio powers production applications including:

  • Slipbox: Privacy-first meeting assistant for real-time conversation intelligence
  • Whisper Mate: Transcribe movie/audio to text locally. Realtime record & transcribe from speaker or system apps. ๐Ÿ”’ All process in local mac Whisper AI Model.

Make a PR if you want to add your app!

Quick Start

import FluidAudio

// Initialize and process audio
Task {
    let diarizer = DiarizerManager()
    diarizer.initialize(models: try await .downloadIfNeeded())

    let audioSamples: [Float] = // your 16kHz audio data
    let result = try diarizer.performCompleteDiarization(audioSamples, sampleRate: 16000)

    for segment in result.segments {
        print("\(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}

Voice Activity Detection Usage

VAD Library API:

import FluidAudio

// Initialize VAD with optimal configuration
let vadConfig = VADConfig(
    threshold: 0.445,             // Optimized for 98% accuracy
    chunkSize: 512,              // Audio chunk size for processing
    sampleRate: 16000,           // 16kHz audio processing
    adaptiveThreshold: true,     // Enable dynamic thresholding
    minThreshold: 0.1,           // Minimum threshold value
    maxThreshold: 0.7,           // Maximum threshold value
    enableSNRFiltering: true,    // SNR-based noise rejection
    minSNRThreshold: 6.0,        // Aggressive noise filtering
    useGPU: true                 // Metal Performance Shaders
)

// Process audio for voice activity detection
Task {
    let vadManager = VadManager(config: vadConfig)
    try await vadManager.initialize()
    
    let audioSamples: [Float] = // your 16kHz audio data
    let vadResult = try await vadManager.detectVoiceActivity(audioSamples)
    
    print("Voice activity detected: \(vadResult.hasVoice)")
    print("Confidence score: \(vadResult.confidence)")
}

Automatic Speech Recognition Usage

import FluidAudio

// Initialize ASR with configuration
let asrConfig = ASRConfig(
    maxSymbolsPerFrame: 3,
    realtimeMode: true,
    chunkSizeMs: 1500,          // Process in 1.5 second chunks
    tdtConfig: TdtConfig(
        durations: [0, 1, 2, 3, 4],
        maxSymbolsPerStep: 3
    )
)

// Transcribe audio
Task {
    let asrManager = AsrManager(config: asrConfig)
    
    // Load models (automatic download if needed)
    let models = try await AsrModels.downloadAndLoad()
    try await asrManager.initialize(models: models)
    
    let audioSamples: [Float] = // your 16kHz audio data
    let result = try await asrManager.transcribe(audioSamples)
    
    print("Transcription: \(result.text)")
    print("Processing time: \(result.processingTime)s")
    
    // For streaming/chunked transcription
    let chunkResult = try await asrManager.transcribeChunk(
        audioChunk,
        source: .microphone  // or .system for system audio
    )
}

Configuration

Customize behavior with DiarizerConfig:

let config = DiarizerConfig(
    clusteringThreshold: 0.7,      // Speaker similarity (0.0-1.0, higher = stricter)
    minActivityThreshold: 10.0,    // Minimum activity frames for speaker detection
    minDurationOn: 1.0,           // Minimum speech duration (seconds)
    minDurationOff: 0.5,          // Minimum silence between speakers (seconds)
    numClusters: -1,              // Number of speakers (-1 = auto-detect)
    debugMode: false
)

CLI Usage

FluidAudio includes a powerful command-line interface for benchmarking and audio processing:

Note: The CLI is available on macOS only. For iOS applications, use the FluidAudio library programmatically as shown in the usage examples above.

Diarization Benchmark

# Run AMI benchmark with automatic dataset download
swift run fluidaudio benchmark --auto-download

# Test with specific parameters
swift run fluidaudio benchmark --threshold 0.7 --min-duration-on 1.0 --output results.json

# Test a single file for quick parameter tuning  
swift run fluidaudio benchmark --single-file ES2004a --threshold 0.8

ASR Benchmark

# Run LibriSpeech ASR benchmark
swift run fluidaudio asr-benchmark --subset test-clean --num-files 50

# Benchmark with specific configuration
swift run fluidaudio asr-benchmark --subset test-other --chunk-size 2000 --output asr_results.json

# Test with automatic download
swift run fluidaudio asr-benchmark --auto-download --subset test-clean

Process Individual Files

# Process a single audio file for diarization
swift run fluidaudio process meeting.wav

# Save results to JSON
swift run fluidaudio process meeting.wav --output results.json --threshold 0.6

Download Datasets

# Download AMI dataset for diarization benchmarking
swift run fluidaudio download --dataset ami-sdm

# Download LibriSpeech for ASR benchmarking
swift run fluidaudio download --dataset librispeech-test-clean
swift run fluidaudio download --dataset librispeech-test-other

API Reference

Diarization:

  • DiarizerManager: Main diarization class
  • performCompleteDiarization(_:sampleRate:): Process audio and return speaker segments
  • compareSpeakers(audio1:audio2:): Compare similarity between two audio samples
  • validateAudio(_:): Validate audio quality and characteristics

Voice Activity Detection:

  • VadManager: Voice activity detection with CoreML models
  • VADConfig: Configuration for VAD processing with adaptive thresholding
  • detectVoiceActivity(_:): Process audio and detect voice activity
  • VADAudioProcessor: Advanced audio processing with SNR filtering

Automatic Speech Recognition:

  • AsrManager: Main ASR class with TDT decoding
  • AsrModels: Model loading and management
  • ASRConfig: Configuration for ASR processing
  • transcribe(_:): Process complete audio and return transcription
  • transcribeChunk(_:source:): Process audio chunks for streaming
  • AudioSource: Enum for microphone vs system audio separation

License

Apache 2.0 - see LICENSE for details.

Acknowledgments

This project builds upon the excellent work of the sherpa-onnx project for speaker diarization algorithms and techniques. We extend our gratitude to the sherpa-onnx contributors for their foundational work in on-device speech processing.

Pyannote: https://github.com/pyannote/pyannote-audio

Wewpeaker: https://github.com/wenet-e2e/wespeaker