bestASR

The benchmark-driven local ASR router for Apple Silicon.

bestASR measures how speech-recognition backends and models actually perform on your machine — then recommends and runs the best setup, and explains why.

Instead of guessing between WhisperKit, whisper.cpp, model sizes, and quantization levels, you benchmark them once against your own audio and a ground-truth .srt, and every later recommend / transcribe is backed by real numbers:

Not: Using medium model. But: whisperkit large-v3-turbo — measured on this machine: CER 5.0%, 12.0x realtime.

Why bestASR?

Local speech-to-text on a Mac has real choices with real trade-offs: WhisperKit rides CoreML and the Neural Engine; whisper.cpp brings flexible GGML quantization. Which is best depends on your machine, your audio, your language, and whether you care more about accuracy or speed.

"Best" should not be a guess. bestASR's moat is three things: measured recommendations, stable fallback, and clear explanations.

Requirements

Apple Silicon Mac (arm64) — Intel Macs and Rosetta are not supported
macOS 14 (Sonoma) or later

Install

git clone https://github.com/PsychQuant/bestASR.git
cd bestASR
swift build -c release
cp .build/release/bestasr /usr/local/bin/  # or anywhere on PATH

Backends:

WhisperKit is built in — models download on demand at first use.
whisper.cpp is optional: brew install whisper-cpp, then place GGML model files under ~/.bestasr/models/whisper-cpp/ (the error message tells you the exact file name and download URL when one is missing). Quantization variants differ per model on HuggingFace — bestasr list-models shows the hosted set (e.g. tiny/base/small ship q5_1, not q5_0).
The model grid additionally carries a reference catalog of 15 MLX-native STT families (Parakeet, Qwen3-ASR, Moonshine, Canary, MMS, Voxtral, …) with verified HuggingFace repos and pinned revisions — visible in bestasr list-models for lookup. No engine is bundled for them (the mlx-audio backend was evaluated and removed (#20); git history has the full implementation if it's ever wanted back).

Quick start

bestasr diagnose                            # what is this machine, and what would it recommend?
bestasr transcribe input.mp3                # best known setup, chosen for you
bestasr transcribe input.mp3 --profile max  # most accurate, time is no object

With no flags, bestASR decides for you — that's the point. The default profile is auto: it reads your hardware (chip, unified memory, Neural Engine), your measured benchmark store, and the machine's current condition — under thermal pressure or Low Power Mode, auto downshifts to a faster tier rather than grinding a hot machine through a huge model, and --explain tells you it did.

Effort profiles

--profile is an ordinal effort ladder (modeled on Claude Code's effort levels): pick how hard the router should chase accuracy, and it maps that to concrete models using your machine's measured numbers.

Profile	Accuracy : speed weighting	What it means
`auto` (default)	—	`medium` normally; `low` when the machine reports pressure (thermal serious/critical, or Low Power Mode). Never applied on top of an explicit choice.
`low`	0.267 : 0.733	Fastest acceptable — drafts, long batch queues
`medium`	0.5 : 0.5	The balanced default
`high`	0.8 : 0.2	Accuracy-leaning, still speed-aware
`xhigh`	0.9 : 0.1	Near-max accuracy
`max`	1.0 : 0	Most accurate regardless of time. A pure argmax over measured error rate; equal-accuracy ties break to the faster candidate

"Most accurate" means measured on your machine whenever benchmark data exists — not a hardcoded model name. Without measurements the top tiers fall back to the same biggest-that-fits cold-start prior (ordinals can only differ once there is data to weigh — run the benchmark).

Because auto reads live machine state, recommend / transcribe with no --profile can resolve differently on a throttled machine (it says so in --explain). If you need a byte-stable result for automation, pass an explicit ordinal — an explicit choice is never touched by machine state.

Migrating from ≤0.7.x: fast → low, balanced → medium, accurate → high (or max when you truly don't care about time). The old names now fail with exactly that hint.

The benchmark workflow (where "best" gets real)

# 1. Measure every available backend/model/quantization against ground truth
bestasr benchmark clip.wav --reference clip.srt --language zh

# 2. From now on, recommendations cite your machine's measured numbers
bestasr recommend clip2.wav --language zh
# → "data_source": "measured", CER + x-realtime from YOUR benchmark

# 3. Transcribe with the winner — and see why it won
bestasr transcribe clip2.wav --language zh --explain

Register your ground truth once (bestasr corpus add talk.wav talk.srt --language zh; scripts/fetch-corpora.sh fetches the three-language standard set — English, Traditional Chinese (Common Voice zh-TW, CC-0), and Japanese (FLEURS), ~20-30 utterances per language in 3-5 medium corpora, every byte digest-pinned. "Chinese" in this project means Traditional Chinese — Taiwanese Mandarin; Simplified is not part of the corpus set) — results land in the BCNF store at ~/.bestasr/store/ (four JSONL tables; measurements are append-only, routing reads the latest per model × corpus × machine). The ground truth is a standard .srt subtitle file. Accuracy is scored as CER for languages without word spacing (zh / ja / ko) and WER otherwise; speed as measured times-realtime (model download/load excluded — WhisperKit pipelines load once per model and are reused, so its timed pass measures pure decode speed; whisper.cpp runs as a subprocess and its timed pass includes a small GGML load); results persist in ~/.bestasr/benchmarks.json per machine.

The regression gate (accuracy never regresses)

scripts/regression-gate.sh   # exit 0 = no corpus regressed; exit 1 names the culprit

benchmarks/baseline.json pins a golden CER/WER per standard corpus for one fixed reference model (whisperkit large-v3-turbo). The gate re-benchmarks every corpus and fails loudly when any accuracy drifts past its tolerance. Two design points worth knowing:

Accuracy only, deterministic decode. CER/WER is a text comparison, so the committed baseline carries no timing numbers; speed is machine-dependent and is never gated. The gate benchmarks with --decode-deterministic (temperature fallback disabled) — Whisper's fallback re-decodes low-quality segments at temperature > 0, which is stochastic sampling and was observed live to flip a corpus CER between runs; the canary pins greedy decoding so goldens are reproducible, while normal transcription keeps the fallback rescue. CoreML inference is still not guaranteed bit-identical across chip generations or OS versions — the per-corpus tolerance absorbs small drift, and the seeding provenance (machine, model-repo revision, decode config) is recorded in benchmarks/baseline-meta.json. Same-machine repeat runs reproduce goldens to ±0.0000; running in CI needs an Apple-silicon runner with the ~1.5 GB reference model.
Traditional Chinese is scored fairly. Whisper-family models emit Simplified for Mandarin; the references here are Traditional. Chinese CER (any zh tag — zh, zh-TW, zh-Hant, …) folds both sides Traditional→Simplified inside the metric (system ICU, the unambiguous direction) so the score measures recognition, not output script. Delivered transcripts are untouched; Japanese kanji are never folded, and --language auto never folds (it cannot tell Chinese from Japanese text — pass an explicit zh tag for folded scoring).

A gate failure has three possible causes — triage before blaming code: a code regression, a corpus change, or upstream model-artifact drift.

Context calibration (make domain terms and names come out right)

Put your documents into a context folder and bestASR biases the decoder toward your vocabulary — and an agent can proofread the result:

# 1. Distill documents (pdf/docx/…) into context.json — agent skill
claude plugin marketplace add PsychQuant/bestASR
#    then ask Claude to run the context-ingest skill on your docs folder

# 2. Transcribe with context (auto-resolves --context-dir >
#    ./bestasr-context/ > ~/.bestasr/context/) and see what got injected
bestasr transcribe input.mp3 --explain

# 3. Prove the biasing works on YOUR audio (± context delta columns)
bestasr benchmark clip.wav --reference clip.srt --context-dir ./bestasr-context

# 4. Agent-side proofreading (three-axis: speaker / timestamp / text,
#    timecodes immutable) — srt-proofread skill

context.json v1 carries terms, names (with aliases + roles — the speaker axis), and phrases; plain .txt/.md term lists work too. Unsupported formats are loudly ignored with guidance. An empty folder changes nothing.

Speaker diarization (who spoke when)

bestasr transcribe meeting.m4a --format srt --diarize

1
00:00:00,000 --> 00:00:09,300
[SPEAKER_1] 先講一下上週的進度……

2
00:00:10,300 --> 00:00:18,000
[SPEAKER_2] 我這邊模型已經跑完了。

Each cue is labeled with the acoustic speaker that overlaps it most. Speakers are numbered in order of first appearance; CoreML diarization models download on first use. Works with every output format (srt / vtt / txt / json).

Speaker identification (who is SPEAKER_1, actually)

Drop a short voice sample per person into a voices/ folder inside your context directory — the filename becomes the label:

bestasr-context/
  context.json
  voices/
    Alice.wav      # a few seconds of Alice speaking, alone
    Bob.m4a

bestasr transcribe meeting.m4a --format srt --diarize
# → [Alice] …, [Bob] …, and any un-enrolled voice stays [SPEAKER_1]

Identification is a post-hoc embedding match against each recording's diarized speakers (cosine distance, 0.65 threshold): enrolled voices are labeled by name, strangers keep their ordinals, and a corrupt sample is skipped with a warning instead of failing the transcription.

Voice prints are sensitive biometric data — bestASR ships no code that transmits them. Concretely: voices/ is in the repo's .gitignore, the context-ingest skill's rules exclude it, and the only reader is the local --diarize run. (These are the enforced mechanisms; bestASR cannot govern what other tools on your machine do with the files.)

Transcribe any source (agent skill)

bestasr transcribe takes a local audio file. The transcript agent skill (in this repo's Claude plugin) wraps it so you can point at any source — a YouTube URL, any yt-dlp-supported site, a local audio/video file, or an existing subtitle — and get an SRT back. It's a conversational skill: you ask Claude in plain language and it drives yt-dlp / ffmpeg / bestasr for you.

"Transcribe https://www.youtube.com/watch?v=xxxx for me" → downloads the audio, ASR-transcribes it, writes an SRT. "Make subtitles for ~/Movies/lecture.mp4, most accurate" → extracts audio, transcribes with --profile max. "Transcribe ~/rec/meeting.m4a with my term list and speaker labels" → --context-dir ./bestasr-context --diarize. "Make a transcript" (no source) → the skill asks what to transcribe.

The skill treats every input as a "source" and branches by type: URLs and video get their audio extracted (yt-dlp / ffmpeg) then ASR-transcribed; audio files go straight to bestasr transcribe; an existing .srt/.vtt is normalized rather than re-transcribed. Downloaded audio lives in a temp dir that's always cleaned up, and the skill treats a pasted URL/path as untrusted input (validated before it reaches any shell command). This is ASR transcription — distinct from grabbing a platform's existing captions (use yt-subtitle-downloader when you want the platform's own subtitles).

See why (--explain)

Every selection is explainable — including what auto decided:

Selected whisperkit large-v3-turbo [measured] because:
  - auto profile resolved to medium (no machine pressure)
  - measured on this machine: CER 5.0%, 12.0x realtime
  - whisperkit preferred on Apple Silicon (CoreML path)

Under load the first line becomes auto profile downshifted to low (thermal state: serious) — no silent behavior changes.

Commands

Command	What it does
`bestasr diagnose`	Hardware profile (chip / unified memory / ANE / macOS) + recommendation
`bestasr benchmark <audio> --reference <gt.srt>`	Measure candidates, print ranked table, persist results (`--json` for machines)
`bestasr recommend <audio>`	JSON recommendation only — measured when data exists, cold-start prior otherwise
`bestasr transcribe <audio>`	Transcribe; `--format txt\|json\|srt\|vtt`, `--output`, `--context-dir`, `--diarize`, `--explain`
`bestasr list-backends`	Backend availability on this machine
`bestasr list-models`	The model grid: whisper sizes + the 15-family mlx-audio catalog with priority tiers
`bestasr corpus add <audio> <ref.srt> --language <l>`	Register ground truth (zh/ja: bring your own material)
`bestasr corpus list`	Registered corpora

How it works

CLI → Detect (chip/memory/ANE, AVFoundation audio probing,
              dynamic machine state: thermal + Low Power Mode)
    → Route  (tier 1: rank measured benchmark records for this chip;
              tier 2: cold-start prior + memory downgrade — and it tells you
              to benchmark)
    → Engine (WhisperKit · whisper.cpp, one normalized interface)
    → Output (txt / json / srt / vtt, optional speaker labels)

Every recommendation carries a reason list. Cold-start recommendations say so honestly and point you at bestasr benchmark.

Development

swift test          # 200+ tests, no real models needed (engines are mocked)
swift build         # debug build

Specs live in openspec/specs/ (Spectra spec-driven development). The original cross-platform Python implementation is preserved under archive/python/ for reference.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.agents/skills		.agents/skills
.claude-plugin		.claude-plugin
.claude		.claude
Sources		Sources
Tests/BestASRKitTests		Tests/BestASRKitTests
archive/python		archive/python
benchmarks		benchmarks
docs		docs
openspec		openspec
plugins/bestasr		plugins/bestasr
scripts		scripts
.gitignore		.gitignore
.spectra.yaml		.spectra.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bestASR

Why bestASR?

Requirements

Install

Quick start

Effort profiles

The benchmark workflow (where "best" gets real)

The regression gate (accuracy never regresses)

Context calibration (make domain terms and names come out right)

Speaker diarization (who spoke when)

Speaker identification (who is SPEAKER_1, actually)

Transcribe any source (agent skill)

See why (--explain)

Commands

How it works

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bestASR

Why bestASR?

Requirements

Install

Quick start

Effort profiles

The benchmark workflow (where "best" gets real)

The regression gate (accuracy never regresses)

Context calibration (make domain terms and names come out right)

Speaker diarization (who spoke when)

Speaker identification (who is SPEAKER_1, actually)

Transcribe any source (agent skill)

See why (--explain)

Commands

How it works

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages