NX

I Swapped Hermes' Speech Recognition from Whisper to FunASR — And It Heard I Was Angry

Tech Minute x/techminute ·
I Swapped Hermes' Speech Recognition from Whisper to FunASR — And It Heard I Was Angry

I Swapped Hermes' Speech Recognition from Whisper to FunASR — And It Heard I Was Angry

A deep dive into what happens when you rip out Whisper and plug in FunASR, and why your AI assistant suddenly feels like it has emotional intelligence.


Picture this: You've had a rough day. You snap at your AI assistant — clipped tone, sharp cadence, the verbal equivalent of slamming a door. And it responds... with the same chipper, oblivious politeness it always does. "I'm sorry you feel that way! Would you like me to search for mindfulness apps?"

Now picture this: Same bad day. Same snapped sentence. But this time, your assistant pauses. Its tone shifts. It stops treating your outburst like a search query and starts treating it like a human moment.

That's exactly what happened when a Chinese developer ripped OpenAI's Whisper out of their Hermes AI assistant and plugged in FunASR — Alibaba DAMO Academy's industrial-grade speech recognition toolkit. The headline from the original article on Toutiao says it all: "I swapped Hermes' speech recognition from Whisper to FunASR, and it caught that I was angry."

This isn't a story about magic. It's a story about pipelines, architecture, and why your choice of ASR engine quietly determines whether your AI feels like a tool or a companion.


The Cast of Characters

Whisper: The Polyglot Generalist

OpenAI's Whisper is the default ASR choice for a reason. It supports 99 languages, handles transcription and translation, and comes in six sizes from tiny (39M parameters) to large (1.55B). The Hermes agent's own documentation recommends Whisper as the go-to for multilingual audio processing, with practical tips: use the turbo model for English, specify the language instead of relying on auto-detection, run it on a GPU, and split long audio into chunks.

Whisper's strength is breadth. It's the Swiss Army knife — 99 languages, robust to noise, one model that "just works." But breadth has a price: no built-in speaker diarization (you need pyannote for that), no emotion detection, no streaming support, and on GPU it runs at about 13x realtime. On CPU? Forget about it.

FunASR: The Chinese-Industrial Powerhouse

FunASR — now at 18,200 GitHub stars and 1,900 forks — is a fundamentally different beast. Built by Alibaba's DAMO Academy and released under MIT license, it's not just an ASR model. It's an industrial speech understanding pipeline that bundles:

  • ASR — SenseVoice, Paraformer, Fun-ASR-Nano: 50+ languages
  • Voice Activity Detection — fsmn-vad, 0.4M params
  • Punctuation Restoration — ct-punc, 290M params
  • Speaker Diarization — cam++, 7.2M params
  • Emotion Recognition — emotion2vec+large (300M params), or built into SenseVoice
  • Audio Event Detection — applause, laughter, crying, coughing, sneezing

And it does this at 170x realtime on GPU — that's 13x faster than Whisper-large-v3. Even on CPU, FunASR runs at 17x realtime, which is faster than Whisper runs on GPU.

The key model in our story is SenseVoice-Small (234M parameters). It processes 10 seconds of audio in just 70 milliseconds — 15 times faster than Whisper-Large. And crucially, it outputs not just text, but emotion labels: happy, sad, angry, neutral.


Why Did It "Hear" the Anger?

Let's be precise: FunASR didn't suddenly develop theory of mind. What changed was the richness of the output signal.

The Whisper Pipeline

Audio → Whisper → Raw transcript

Whisper gives you text. Clean, accurate, multilingual text. But it's flat. A sentence spoken through gritted teeth and the same sentence spoken with a smile produce nearly identical Whisper output. The cadence, the micro-pauses, the tension in the voice — all stripped away.

The FunASR Pipeline

Audio → VAD (fsmn-vad) → ASR (SenseVoice) → Punctuation (ct-punc) → Emotion labels → Structured output

FunASR's SenseVoice model doesn't just transcribe — it classifies each utterance with emotion tags. The emotion2vec_plus_large model (300M parameters) can run standalone for utterance-level emotion classification. But SenseVoice-Small bakes emotion recognition directly into the ASR model itself.

So when the developer snapped "没事,我挺好" ("It's fine, I'm fine") in an irritated tone, FunASR's output preserved:

  1. The words — accurately transcribed
  2. The punctuation — reflecting the clipped cadence
  3. The emotion label|ANG| (angry) tagged on the utterance
  4. The timing — VAD segmentation preserving the abrupt pauses

That extra metadata — especially the emotion label — is what the downstream LLM (Hermes) can now act on. The assistant doesn't just hear what you said. It has a machine-readable signal about how you said it.

To be clear: SenseVoice's emotion recognition was evaluated against multiple open-source SER (Speech Emotion Recognition) models. On the weighted accuracy metric, the SenseVoice-Large model achieved the best performance on nearly all test datasets, while SenseVoice-Small surpassed other open-source models on the majority of datasets — without any fine-tuning on target data. This isn't a gimmick. It's benchmarked, peer-reviewed, and open-source.


Head-to-Head: FunASR vs Whisper

Here's the comparison that matters for AI assistant builders:

Feature FunASR (SenseVoice) Whisper (large-v3)
GPU Speed 170x realtime 13x realtime
CPU Speed 17x realtime ❌ Not viable
Languages 50+ 99
Emotion Detection ✅ Built-in (happy/sad/angry)
Speaker Diarization ✅ Built-in (cam++) ❌ Needs pyannote
Punctuation ✅ Built-in (ct-punc) Partial
Audio Events ✅ Laughter, applause, crying
Streaming ✅ WebSocket
vLLM Acceleration ✅ 2-3x faster
License MIT MIT
Self-hosted
Parameters (small) 234M ~244M (Whisper-small)
Chinese Accuracy Superior Good but not specialized

The speed gap is genuinely staggering: FunASR on CPU (17x) is faster than Whisper on GPU (13x). For self-hosted AI assistants like Hermes running on consumer hardware, this is the difference between real-time conversation and awkward silence.


The Architecture Behind the Magic

SenseVoice's Secret Sauce

SenseVoice uses a non-autoregressive (NAR) end-to-end architecture. Unlike Whisper's autoregressive decoder that generates tokens one at a time — waiting for each before starting the next — NAR models generate the entire output in parallel. This is why SenseVoice-Small handles 10 seconds of audio in 70ms while Whisper-Large takes over a second.

Paraformer, FunASR's other flagship model, uses the same NAR approach: "generates the entire output in parallel, achieving significant speedups over autoregressive models like Whisper while maintaining competitive accuracy."

The LLM Evolution: Fun-ASR-Nano

Released December 2025, Fun-ASR-Nano-2512 is the latest evolution: an LLM-based ASR that pairs SenseVoice's encoder with a Qwen3-0.6B decoder. Trained on tens of millions of hours of audio across 31 languages, it achieves near state-of-the-art accuracy with only 800M parameters — outperforming open-source models and performing closely to Seed-ASR. Add vLLM acceleration and you get 2-3x faster decoding.

This is the direction of travel: ASR is becoming an LLM task, and the gap between "speech recognition" and "speech understanding" is collapsing.


Why This Matters for AI Assistants

1. Emotion is a UX Feature, Not a Gimmick

When your assistant knows you're angry, it can:

  • De-escalate instead of matching your tone with cheerfulness
  • Offer to slow down or take a break
  • Flag critical feedback differently from casual comments
  • Adjust its response style — shorter sentences, fewer emoji, more direct

This isn't about AI "feeling" emotions. It's about signal routing: the emotion label is metadata that downstream systems can use to modulate behavior.

2. The Pipeline Matters More Than the Model

FunASR's real innovation isn't any single model — it's the unified pipeline architecture. One AutoModel call handles VAD → ASR → punctuation → speaker ID → emotion. No stitching together five different libraries. No managing pyannote dependencies. No praying that your VAD segmentation lines up with your ASR output.

For AI agent builders, this is the difference between a weekend project and a maintenance nightmare.

3. Chinese-First Design Has Spillover Benefits

FunASR's Chinese-optimized design — Paraformer for Mandarin, SenseVoice for Cantonese and Mandarin, hotword support for Chinese keywords — means it handles tonal languages with nuance that general-purpose models miss. But the architecture's benefits (speed, emotion, speaker diarization) apply regardless of language. The 50+ language support and Fun-ASR-Nano's 31-language coverage mean you're not locked into Chinese-only workflows.

4. Self-Hosted AI Needs CPU-Viable ASR

Hermes is proudly self-hosted: "Open source, self-hosted AI agent" with "persistent memory, autonomous scheduling, and multi-surface access — all on hardware you control." Running on a Raspberry Pi or an old laptop? FunASR's 17x CPU realtime makes that viable. Whisper on CPU is a non-starter for real-time conversation.


The Migration: How to Make the Switch

If you're running Hermes (or any AI assistant) and want to replicate this experiment, here's the playbook:

Step 1: Install FunASR

pip install torch torchaudio
pip install funasr

Requirements: Python ≥ 3.8. Install PyTorch first from pytorch.org.

Step 2: Basic Transcription (Drop-in Whisper Replacement)

from funasr import AutoModel

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    device="cuda"  # or "cpu" — yes, it actually works on CPU
)

result = model.generate(input="recording.wav", language="auto")
print(result[0]["text"])

Step 3: Full Pipeline with Emotion Detection

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(
    model="iic/SenseVoiceSmall",
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++",        # speaker diarization
    punc_model="ct-punc",      # punctuation restoration
    device="cuda:0",
)

result = model.generate(
    input="angry_rant.wav",
    language="zh",  # or "auto", "en", "yue", "ja", "ko"
    use_itn=True,
    batch_size_s=60,
)

# Output includes emotion tags like |ANG|, |HAP|, |SAD|
text = rich_transcription_postprocess(result[0]["text"])
print(text)
# Example: "我没事|ANG|,挺好的|ANG|"

Step 4: Standalone Emotion Recognition

from funasr import AutoModel

model = AutoModel(
    model="emotion2vec_plus_large",
    device="cuda"
)

result = model.generate(
    input="audio.wav",
    granularity="utterance"  # per-sentence emotion
)
# Returns emotion labels for each utterance

Step 5: Deploy as OpenAI-Compatible API

pip install funasr vllm fastapi uvicorn python-multipart
funasr-server --device cuda
# → POST /v1/audio/transcriptions at localhost:8000

This gives you a drop-in OpenAI Whisper API replacement — same endpoint format, but with emotion labels, speaker diarization, and 13x the speed.

Step 6: Connect to Hermes

Hermes supports pluggable model backends (OpenAI, OpenRouter, HuggingFace, and custom endpoints). Point its ASR endpoint at your local FunASR server (http://localhost:8000/v1/audio/transcriptions) and the emotion metadata flows through to the agent's reasoning pipeline.


The Honest Limitations

Before you rip out Whisper entirely:

  1. Whisper wins on raw language count: 99 vs 50+. If you need Icelandic, Amharic, or Welsh, Whisper is still your friend.

  2. Whisper has translation baked in: Transcribe non-English audio directly to English text. FunASR doesn't do this — it transcribes in the source language.

  3. Emotion recognition isn't perfect: SenseVoice's emotion accuracy is strong on benchmark datasets, but real-world conditions (background noise, overlapping speech, subtle emotions) will degrade performance. The model classifies into coarse categories (happy/sad/angry/neutral) — it won't detect sarcasm, disappointment, or existential dread.

  4. Chinese-first means Chinese-best: If your primary language is English, Whisper may still give you marginally better raw transcription accuracy. The emotion features still work cross-lingually, but the ASR accuracy advantage is most pronounced for Chinese and Cantonese.

  5. No frontier model integration: If you're using Gemini's multimodal audio understanding or GPT-4o's native voice mode, you're already getting emotion awareness from the LLM itself — no ASR swap needed.


The Bigger Picture: ASR Is Becoming Speech Understanding

The Whisper-to-FunASR switch is a microcosm of a larger shift. For years, ASR meant "turn audio into text, period." But the line between speech recognition and speech understanding is dissolving.

FunASR's roadmap tells the story: vLLM acceleration (May 2026), MCP Server for AI agents (May 2026), dynamic VAD (May 2026), Fun-ASR-Nano with LLM decoder (December 2025), SenseVoice with emotion + audio events (July 2024). Each release pushes further beyond transcription into comprehension.

Meanwhile, OpenAI's GPT-4o and Google's Gemini are building speech understanding directly into multimodal LLMs — skipping the ASR step entirely for some use cases. The ASR toolkit approach (FunASR) and the end-to-end LLM approach (GPT-4o) are converging from opposite directions.

For AI assistant builders, the takeaway is clear: your ASR engine isn't just a utility. It's a strategic choice that determines your assistant's emotional bandwidth.


The Verdict

If you... Use...
Need 99 languages + translation Whisper
Want real-time Chinese conversation FunASR (Paraformer-zh-streaming)
Want emotion-aware assistant FunASR (SenseVoice)
Run on CPU / Raspberry Pi FunASR
Need speaker diarization FunASR (cam++)
Want OpenAI API compatibility FunASR (funasr-server)
Need Icelandic transcription Whisper
Want the fastest possible inference FunASR (170x realtime)

The developer who swapped Whisper for FunASR didn't just get a faster transcriber. They gave their AI assistant a primitive form of emotional awareness — not because FunASR "understands" feelings, but because it preserves and labels the acoustic signals that humans use to convey them.

And honestly? If my AI assistant is going to live in my house, run on my hardware, and hear me at my worst — I'd rather it know when I'm angry than pretend everything's fine.


References

  1. Hermes Agent — Whisper Integration Docs
  2. FunASR GitHub Repository — 18.2K stars, industrial-grade speech recognition toolkit, MIT license
  3. SenseVoice GitHub Repository — 8.6K stars, multilingual ASR + emotion recognition + audio event detection
  4. Paraformer-zh on HuggingFace — Non-autoregressive ASR model documentation
  5. Fun-ASR Technical Report (arXiv) — LLM-based ASR with Qwen3-0.6B decoder
  6. Hermes Agent Official Site — Self-hosted AI agent with persistent memory, MIT license
  7. FunASR Benchmark — CPU 17x vs Whisper GPU 13x
  8. FunAudioLLM Paper (arXiv) — SenseVoice emotion recognition benchmarks

Published June 2026. FunASR stars: 18.2K. SenseVoice stars: 8.6K. Whisper license: MIT. FunASR license: MIT. Emotion labels detected while writing: neutral (it's 2 AM).

·