Skip to content

feat(asr/canary): Canary-1B-v2 AED engine + CTC-spotter custom vocab#709

Open
Alex-Wengg wants to merge 2 commits into
mainfrom
feat/canary-asr
Open

feat(asr/canary): Canary-1B-v2 AED engine + CTC-spotter custom vocab#709
Alex-Wengg wants to merge 2 commits into
mainfrom
feat/canary-asr

Conversation

@Alex-Wengg

@Alex-Wengg Alex-Wengg commented Jun 17, 2026

Copy link
Copy Markdown
Member

Summary

Adds NVIDIA Canary-1B-v2 (attention encoder-decoder, AED) as a selectable on-device ASR engine, converted to CoreML and running int4 on the Neural Engine. Canary is a more accurate transcriber than the existing TDT/CTC paths (esp. on hard domains), and this PR also gives it custom-vocabulary support by reusing the existing CTC keyword spotter.

Users pick the engine that fits: canary (best WER) or the existing ctc custom-vocab path (fastest, top keyword recall).

What's included

  • CanaryManager — actor; pipeline: fp32/CPU mel preprocessor → FastConformer encoder → autoregressive transformer decoder + 1024→16384 projection → greedy decode to EOS. Reads the decoder sequence length from the model so a shorter export is picked up automatically.
  • CanaryModels — download/load from FluidInference/canary-1b-v2-coreml (int4 ANE default / fp16 parity / int8 CPU); CanaryPrecision.
  • CanaryKeywordBooster — custom-vocab support for canary by reusing CtcKeywordSpotter: fuzzy-replace mis-transcribed terms, plus timestamp-guided insertion of keywords canary missed entirely. Precision-protected via a score floor.
  • CLIcanary-transcribe (file + LibriSpeech benchmark) and canary-earnings-benchmark (Earnings22-keywords, OpenBench-comparable WER + keyword P/R/F1).
  • ModelNamesRepo.canary1bV2, ModelNames.Canary, wired into getRequiredModelNames.
  • TestsCanaryConfigTests (registration / precision / config contract).

Benchmarks (int4, Apple Silicon ANE)

LibriSpeech test-clean (≤15s): WER ~1.7%, RTFx ~10.8x.

Earnings22-keywords (full 772 chunks), scored by the same whole-word keyword metric as Argmax's OpenBench:

Engine WER Keyword F1 RTFx
CTC custom-vocab (existing) 22.5% 0.97 35.8x
Canary + vocab + injection 16.5% 0.95 10.8x

Both beat Argmax's published parakeet-v2/v3 keyword F1 (0.91 / 0.89). Canary additionally wins WER by ~6 points.

Model conversion

CoreML conversion pipeline lives in the mobius repo (models/stt/canary-1b-v2/coreml/): NeMo→CoreML export, int4/int8 quantization, projection model, validation (byte-exact vs PyTorch greedy decode), HF staging. Models hosted at FluidInference/canary-1b-v2-coreml.

Notes / follow-ups

  • int4 requires iOS18 / macOS15 (int4 weight payloads); fp16 is the iOS17 fallback.
  • Decoder has no KV cache yet — re-runs the sequence each step, so canary is ~3x slower than the CTC path. A cache-external decoder export is the planned follow-up to close the gap.
  • 15s window per decode; audio >15s is now chunked into overlapping 15s windows (3s overlap) and stitched at the seam via token-level longest-common-substring. Audio ≤15s is unchanged.

🤖 Generated with Claude Code

Add NVIDIA Canary-1B-v2 (attention encoder-decoder) as a selectable ASR
engine, converted to CoreML (int4 on ANE, iOS18). Pipeline: fp32/CPU mel
preprocessor -> FastConformer encoder -> autoregressive transformer decoder
+ 1024->16384 projection, greedy decode to EOS.

- CanaryManager: actor, 15s window, reads decoder seq length from the model
- CanaryModels: download/load from FluidInference/canary-1b-v2-coreml (int4/fp16/int8)
- CanaryKeywordBooster: reuses the CTC keyword spotter to add custom-vocabulary
  support to canary (fuzzy replace + timestamp-guided insertion)
- CLI: canary-transcribe, canary-earnings-benchmark (OpenBench-comparable P/R/F1)
- ModelNames: Repo.canary1bV2 + ModelNames.Canary + CanaryPrecision
- Tests: CanaryConfigTests

Earnings22-keywords (full 772, same scorer as OpenBench):
  canary+vocab WER 16.5%, keyword F1 0.95 (beats Argmax parakeet-v3 0.89)
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (153.8 KB)

Runtime: 1m30s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 12.34x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 39.3s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.039s Average chunk processing time
Max Chunk Time 0.079s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m13s • 06/17/2026, 10:20 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 13.49x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 15.186 19.5 Fetching diarization models
Model Compile 6.508 8.4 CoreML compilation
Audio Load 0.048 0.1 Loading audio file
Segmentation 21.556 27.7 VAD + speech detection
Embedding 77.585 99.7 Speaker embedding extraction
Clustering (VBx) 0.099 0.1 Hungarian algorithm + VBx clustering
Total 77.809 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 99.2s processing • Test runtime: 1m 48s • 06/17/2026, 10:01 PM EST

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.54x
test-other 1.59% 0.00% 3.37x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.59x
test-other 1.00% 0.00% 3.61x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.55x Streaming real-time factor
Avg Chunk Time 1.610s Average time to process each chunk
Max Chunk Time 2.201s Maximum chunk processing time
First Token 1.842s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.63x Streaming real-time factor
Avg Chunk Time 1.406s Average time to process each chunk
Max Chunk Time 1.562s Maximum chunk processing time
First Token 1.430s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 10m39s • 06/17/2026, 10:28 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Supertonic3 Smoke Test ✅

Check Result
Build
Model download (incl. VectorEstimatorVariants/ int4 buckets)
Model load
Synthesis pipeline (--ve-variant int4)
Output WAV ✅ (364.7 KB)

Runtime: 0m27s

Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 30.3% <35%
Miss Rate 28.2% - -
False Alarm 0.9% - -
Speaker Error 1.2% - -
RTFx 13.2x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 45s • 2026-06-18T02:09:28.329Z

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 18.07x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 12.777 22.0 Fetching diarization models
Model Compile 5.476 9.4 CoreML compilation
Audio Load 0.085 0.1 Loading audio file
Segmentation 17.418 30.0 Detecting speech regions
Embedding 29.030 50.0 Extracting speaker voices
Clustering 11.612 20.0 Grouping same speakers
Total 58.083 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 58.1s diarization time • Test runtime: 3m 26s • 06/17/2026, 10:04 PM EST

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 721.4x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 754.6x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

Split audio longer than the 15s window into overlapping 15s windows
(hop = 15s - 3s overlap), decode each independently, and stitch adjacent
windows at the seam via token-level longest-common-substring
(mergeTokenStreams, mirroring CoherePipeline). Audio <=15s is unchanged
(single-window). No model change - each window still sees the fixed 15s
contract and the decoder is reset per window.

Unblocks >15s datasets (e.g. FDA) that the fixed-window decoder previously
truncated. Adds CanaryChunkMergeTests for the seam stitcher.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant