NetEase Youdao Just Dropped Confucius4-TTS: 3-Second Voice Cloning Across 14 Languages, Apache 2.0, No Strings Attached

Three seconds. That's all it takes. Record a three-second clip of anyone speaking — in any of 14 languages — and Confucius4-TTS will clone that voice and make it speak fluently in all the others. No reference transcript needed. No fine-tuning. And the entire thing is open-source under Apache 2.0, which means you can use it commercially without a second thought.

NetEase Youdao released Confucius4-TTS on June 23, 2026 under their "Ziyue 4.0" (子曰4.0) initiative, and the open-source TTS landscape just got a serious shake-up.

Sound wave visualization with 14 languages flowing together

What Makes Confucius4-TTS Different?

Let's cut through the hype. The TTS space is crowded — Fish-Speech, GPT-SoVITS, CosyVoice, ElevenLabs, you name it. So what does Confucius4-TTS actually bring to the table that we haven't seen before?

1. No Reference Text Required (This Is the Big One)

Almost every other voice cloning model in existence — CosyVoice, Fish-Speech, OmniVoice, VoxCPM2 — requires you to provide a transcript of what the reference speaker is saying. Confucius4-TTS doesn't. You feed it a 3-second WAV file and it figures out the rest. This is what the project calls "Unconstrained Voice Cloning," and it's not just marketing — the benchmark tables on GitHub confirm it. Every single competitor marked with a "†" requires reference text; Confucius4-TTS sits alone without that asterisk across multiple benchmarks.

2. Cross-Lingual Without the Accent

If you've ever used a multilingual TTS system, you know the pain: your cloned English voice suddenly develops a heavy Chinese accent when speaking Japanese, or vice versa. Confucius4-TTS explicitly tackles this. The 14 supported languages — Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese — all maintain consistent voice character without the typical "accent bleed." More languages are promised soon.

3. Emotion Transfer Built In

This isn't just about tone matching. The model extracts emotional features from the reference audio — intonation, prosody, rhythm — and carries them across languages. A happy-sounding Chinese speaker will sound happy in German too. An angry English clip produces angry-sounding Korean. This is a subtle but powerful feature that most open-source TTS models either ignore or handle poorly.

Developer setup with voice cloning code

Architecture Deep Dive

Under the hood, Confucius4-TTS packs a 1.3B parameter model using a "speech encoder + LLM" architecture. Here's the pipeline:

Speech Encoder (Wav2Vec2-BERT): Extracts speaker identity and semantic features from the 3-second reference audio
LLM Backbone (GPT-style): Generates semantic token sequences from the target text, conditioned on the extracted speaker embedding
Flow Matching (Semantic2Acoustic): Converts semantic tokens into mel spectrograms using a flow-matching generator — no traditional vocoder
BigVGAN Vocoder: Final waveform synthesis from mel spectrograms

This is a significant departure from traditional TTS pipelines that rely on autoregressive vocoders. Flow matching, popularized by models like Stable Diffusion 3, gives Confucius4-TTS better control over generation quality and speed.

The 54GB complete resource package includes everything: T2S model weights, S2A model, tokenizer, speaker encoder checkpoints, and configuration files. You can run it locally, offline, forever — no cloud API calls needed once deployed.

Requirements: Python 3.10, CUDA 12.6. A GPU with decent VRAM is recommended (the model is 1.3B parameters after all).

Benchmark Performance: Strong but Not Untouchable

The GitHub README contains extensive benchmark results across four evaluation suites. Here's the honest picture:

Benchmark	Confucius4-TTS Position	Key Competitor
CV3-eval (en→zh)	WER 6.71 — beats F5-TTS, Spark-TTS, CosyVoice2	CosyVoice3+DiffRO edges ahead (5.16)
X-Voice (de→zh)	WER 2.86 — competitive with X-Voice (3.07)	OmniVoice wins SIM (0.691 vs 0.569)
Seed-TTS-eval (English)	WER 1.49 — competitive	Qwen3-TTS slightly better (1.24)
MiniMax (German)	WER 0.47 — beats ElevenLabs (0.57)	FishAudio S2 close at 0.55

The pattern is clear: Confucius4-TTS is highly competitive but not dominant. Its real edges are in the no-reference-text constraint and cross-lingual consistency, not raw accuracy numbers. On the MiniMax multilingual test, the model absolutely crushes ElevenLabs on Thai (WER 1.56 vs 73.94) and Vietnamese (1.61 vs 73.42), showing where its true multilingual strength lies.

Benchmark visualization for TTS models

The Competitive Landscape

Here's where Confucius4-TTS fits in the 2026 open-source TTS ecosystem:

Fish-Speech: 50+ languages, 10M+ hours training data, dual autoregressive architecture — wider coverage but requires reference text
GPT-SoVITS: 45K+ GitHub stars, 5-second zero-shot cloning, strong community — but fewer languages
CosyVoice 3 (Alibaba): 1.5B parameters with DiffRO, excellent benchmarks — but needs 10-20 seconds of audio and reference text
Chatterbox: 23 languages, 5-10 second reference — solid but unremarkable
Confucius4-TTS: 14 languages, 3-second reference, no transcript needed, Apache 2.0 — the "friction-free" option

The 3-second requirement is genuinely industry-leading. Most competitors need 5-20 seconds of clean audio. And the fact that you don't need to provide what the speaker is actually saying? That's a workflow game-changer.

The Catch(es)

Let's be real — nothing is perfect:

54GB is heavy. For a 1.3B model, that's a lot of disk. This isn't running on your Raspberry Pi. You'll want a proper GPU setup.

85% similarity is self-reported. The "85% similarity, 97% accuracy" numbers come from NetEase's internal testing. Third-party independent evaluation is still thin. Early community testers on X/Twitter report "natural and fluent" results but note that "100% reproduction of nuanced timbre isn't achievable yet."

14 languages isn't 50. Fish-Speech covers ~50 languages; Confucius4-TTS is at 14 with "more coming soon." If you need niche language support today, you may need to look elsewhere.

448 GitHub stars as of late June 2026. For context, GPT-SoVITS has 45K+. The community is still nascent. That means fewer third-party tutorials, forks, and integrations — for now.

Quick Start

Want to try it right now? Here's the five-minute path:

# 1. Clone
git clone https://github.com/netease-youdao/Confucius4-TTS.git
cd Confucius4-TTS

# 2. Environment
conda create -n confuciustts python=3.10 -y
conda activate confuciustts
pip install -r requirements.txt

# 3. Clone a voice (3-second WAV → synthesized speech)
python example.py \
  --prompt_wav path/to/reference.wav \
  --text "Hello, this is a test of zero-shot voice cloning." \
  --lang en \
  --out output.wav \
  --config config/inference_config.yaml

Or try the online demo without installing anything.

The Bigger Picture

Confucius4-TTS matters beyond the feature checklist. It's another data point in the accelerating trend of Chinese tech companies going all-in on open-source AI. Following DeepSeek, Alibaba's Qwen, and ByteDance's various releases, NetEase Youdao is betting that Apache 2.0 + commercial freedom + zero-friction cloning is the right formula to win developer mindshare.

For content creators, this means one thing: the barrier to multilingual content creation just dropped again. Dubbing short dramas for international audiences, creating multilingual educational content, building digital humans that speak 14 languages natively — all of this just got cheaper and simpler.

The 3-second, no-transcript constraint is the killer feature. Everything else is table stakes. Whether Confucius4-TTS builds the community it deserves will depend on how well the model generalizes beyond the benchmarks — and how aggressively NetEase iterates. For now, it's absolutely worth a weekend project.