feat(vad): FSMN-VAD backend (CoreML)#653
Conversation
CoreML FSMN-VAD from FluidInference/fsmn-vad-coreml: 2-stage (fbank80+LFR preprocessor fp32/CPU -> FSMN scorer fp16/ANE enumerated [512..3072] -> [1,T,248] scores) + a host decision (port of FunASR FsmnVADStreaming: speech if silence_prob<=0.2, 20-frame window hysteresis at 15, max_end_silence 800ms, lookback/lookahead, max_single_segment 60s) -> [start_ms,end_ms]. Long audio chunked at ~30s; silence probs concatenated, decision once. - ModelNames: fsmnVad Repo + FsmnVad registry - VAD/Fsmn/: FsmnVadModels, FsmnVadManager (+ FsmnVadSegment) - CLI: fsmn-vad-segment Verified vs FunASR on 20s clip: [120,19960] vs [70,19980] (~50ms). Alternative to silero-vad.
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m12s • 06/01/2026, 11:12 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
PocketTTS Smoke Test ✅
Runtime: 0m26s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 36.9s diarization time • Test runtime: 2m 30s • 06/01/2026, 11:12 AM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 4m 0s • 2026-06-01T15:13:11.849Z |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m26s • 06/01/2026, 11:09 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Add --backend fsmn to vad-benchmark (same labeled dataset + per-clip metric as silero). On mini50: FSMN-VAD F1 98.0% (P 96.2/R 100) vs silero 84.7% (P 73.5/R 100), RTFx 640x. Update Benchmarks.md with the apples-to-apples comparison.
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 4m23s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 174.0s processing • Test runtime: 4m 34s • 06/01/2026, 11:24 AM EST |
… 69.8% specificity) Full FluidInference/musan noise set (774 clips): FSMN-VAD rejects 81.9% of noise as non-speech (18.1% FP) vs silero 69.8% (30.2% FP) — 12pp fewer false positives. Complements the balanced mini50 F1 (98.0% vs 84.7%).
The Lab41/VOiCES-subset repo now ships audio inside VOiCES_90_*.tar archives (deeply nested) rather than loose clean/ + noisy/ wavs, so the downloader silently produced 0 files. Extract every tar and classify each wav by the noise tag in its filename (-none- = clean, else noisy); error if a clone yields no wavs (layout changed again). loadVoicesSubset ignored --all-files: count == -1 was hard-coded to 25 speech samples (12 clean + 12 noisy), so --all-files ran only ~49 files. Now -1 loads every VOiCES clip (908) and balances the MUSAN negatives to the speech count (subject to locally available noise). Verified: download yields 227 clean + 681 noisy; --all-files runs 933 files (908 speech + 25 noise), F1 99.9%, ~1334x RTFx on M5 Pro.
# Conflicts: # Sources/FluidAudio/ModelNames.swift
- FSMN backend leaked memory on long files (autoreleased MLMultiArrays + AVAudio buffers accumulated across chunks/files -> ~8GB RSS, OOM on full MUSAN). Wrap per-chunk scoring and per-file resampling in autoreleasepool; RSS now ~240MB. - Mark FSMN-VAD as beta/experimental in docs and CLI: on a balanced full-MUSAN set it has high recall but over-triggers on music (low precision), so silero-vad stays the recommended default. Drop the non-representative head-to-head tables. - Persist FSMN benchmark metrics to fsmn_vad_results.json (release logs info to os_log only).
Keep the code changes from 572178b (FSMN memory fix, CLI beta warning, fsmn_vad_results.json output); drop the Benchmarks.md edits per request.
|
Thanks for adding the FSMN-VAD backend. I’m curious whether there are any existing benchmark comparisons between Ten-VAD and FSMN-VAD for this use case, especially around VAD accuracy and streaming performance/latency. If you’ve already looked at Ten-VAD, I’d be interested to know how it compares as a CoreML VAD candidate. |
|
@LemonCANDY42 i have not had the time to benchmark. i am too sure how it performs compared to tad but i think vad silero is still the bset |
Got it, thanks. |
Summary
Adds FSMN-VAD (FunASR, ~5.2M) as a CoreML voice-activity-detection backend. Model:
FluidInference/fsmn-vad-coreml.Pipeline
The decision ports FunASR's
FsmnVADStreaming: per-framespeech if silence_prob ≤ 0.2, a 20-frame sliding-window hysteresis (sil→speech / speech→sil at 15), silence→endpoint aftermax_end_silence(800 ms),lookback/lookahead, andmax_single_segment(60 s). Audio > ~30 s is processed in chunks; per-frame silence probs are concatenated and the decision runs once.(RangeDim breaks the FSMN's dilated conv on the BNNS path, so the scorer uses fixed enumerated buckets and the host chunks long audio.)
Changes
ModelNames:fsmnVadRepo+FsmnVadregistrySources/FluidAudio/VAD/Fsmn/:FsmnVadModels,FsmnVadManager(+FsmnVadSegment)fsmn-vad-segmentVerification
On a 20 s clip:
[120, 19960]vs FunASR[70, 19980]— boundaries within ~50 ms. 0.04 s, 0.056 GB peak.Notes
silero-vad.