🍋 Lemonade: AMD's Secret Sauce That Turns Your PC Into a Local AI Supercomputer

Tired of paying cloud AI subscriptions? Worried about your data floating around someone else's server? There's a refreshing new solution — and it's backed by none other than AMD. Meet Lemonade, the open-source local AI server that's making waves in 2026, and it might just be the coolest thing to hit your PC this year.

I tested this myself on a remote Ubuntu 26.04 mini PC with a Ryzen AI 9 HX 370, 89 Gi RAM, and XDNA 2 NPU — controlled via SSH from another continent. Here's exactly what worked, verified command by command, with real benchmark results.

🍋 What Is Lemonade, Exactly?

In plain English: Lemonade is a lightweight, open-source local AI server that runs large language models, generates images, transcribes speech, and synthesizes voice — all 100% on your own computer. No cloud. No subscription. No data leaks.

Here's the kicker: the entire server binary is just ~2MB. Yes, megabytes. It's written in native C++ and starts up faster than you can pour a glass of lemonade on a hot summer day.

Official definition: "Lemonade is a local AI runtime with every capability you need to build great experiences. Automatically optimized for your GPU and NPU."

The project lives on GitHub under lemonade-sdk/lemonade and has racked up over 4,600 stars and 1,100+ commits from a growing community of developers — with AMD engineers actively contributing as core maintainers.

🏆 AMD's Official Stamp of Approval

This isn't just another random GitHub project. In February 2026, AMD published a heavyweight technical article on their official developer portal titled "Lemonade by AMD: A Unified API for Local AI Developers".

AMD's AI Developer Enablement team — Victoria Godsoe, Jeremy Fowers, Daniel Holanda Noronha, and Krishna Sivakumar — explained the vision plainly:

"Developers need free, private, and optimized on-device AI with all the LLM, speech, and image capabilities required for natural interactions and powerful outcomes."

They're positioning Lemonade as the core foundation of the AI PC ecosystem, specifically optimized for AMD Ryzen AI NPUs and Radeon GPUs.

⚡ Hardware Compatibility: AMD Gets the VIP Treatment (But Everyone's Invited)

Here's the question everyone asks: "Do I need an AMD PC?"

Nope! Lemonade works across all major platforms. But AMD hardware gets the red-carpet treatment:

🥇 Best Support — AMD Family

Hardware	Acceleration
AMD Ryzen AI NPUs (XDNA2)	NPU-accelerated inference via FastFlowLM
AMD Radeon dGPU / iGPU	ROCm + Vulkan full-stack acceleration
Strix Halo (Ryzen AI MAX+)	Hybrid NPU + GPU execution, up to 128GB unified memory

🥈 Universal Support — Works Everywhere

Hardware	Backend
NVIDIA GPUs (Turing to Blackwell)	CUDA + Vulkan
Intel Arc / iGPU	Vulkan
Apple Silicon (M-series)	Metal (macOS beta)
Any CPU	Pure CPU fallback for Windows & Linux

Bottom line: AMD is the favorite child, but Lemonade plays nice with everyone.

🧠 The Architecture: Why It's So Clever

Lemonade isn't just a wrapper — it's a multi-engine orchestrator that automatically picks the best backend for your specific hardware:

Modality	Engines Available
Text / Chat	llama.cpp (Vulkan, ROCm, CUDA, Metal, CPU), FastFlowLM (NPU), RyzenAI-LLM (NPU), vLLM (experimental, ROCm)
Image Generation	stable-diffusion.cpp (ROCm, Vulkan, CUDA, CPU)
Speech-to-Text	whisper.cpp (NPU, Vulkan, CPU), Moonshine (CPU)
Text-to-Speech	Kokoro (CPU)

When you type lemonade run Qwen3-8B-GGUF, it auto-detects your hardware and selects the optimal backend — no manual configuration needed.

🧵 Concurrency Model: How It Handles Multiple Requests

Under the hood, Lemonade uses a sophisticated three-layer concurrency architecture:

   Req-1 ──┐                              ┌── Backend Process (llama.cpp)
   Req-2 ──┼──► [8-Thread Pool] ──► Router ──┼── Backend Process (FastFlowLM/NPU)
   Req-3 ──┘                              └── Backend Process (Whisper)
        ▲                                       ▲
   HTTP Layer (cpp-httplib)              OS Subprocess Layer
                                          (one per loaded model)

Layer 1 — HTTP Thread Pool: 8 worker threads grab incoming requests simultaneously. A lightweight /v1/models call returns instantly while heavy chat completions stream tokens — no head-of-line blocking.

Layer 2 — Router: Directs each request to the right backend subprocess. Serializes model loading (only one model loads at a time) but concurrent inference flows freely.

Layer 3 — NPU Exclusivity: On XDNA 2 with FastFlowLM, the NPU supports multi-instance execution — you can run LLM chat, audio transcription, and embeddings concurrently on the same NPU with only ~5.8% decode speed penalty. Compared to GPU concurrent inference (which drops off a cliff), this is a game-changer.

🧪 Real-World Test Results: Benchmarked on Ryzen AI 9 HX 370

I ran a full end-to-end test on a remote Ubuntu 26.04 mini PC with a Ryzen AI 9 HX 370, 89 Gi LPDDR5x RAM, and XDNA 2 NPU — all controlled via SSH from across the globe. Here are the raw, unedited results.

Test 1: Installation (Remote, No Reboot)

$ sudo add-apt-repository ppa:lemonade-team/stable
$ sudo apt install amdxdna-dkms lemonade-server
$ sudo modprobe amdxdna                              # NO REBOOT!
$ ls -l /dev/accel/accel0
crw-rw---- 1 root render 261, 0 Jun 12 02:48 /dev/accel/accel0
$ lemonade --version
lemonade version 10.8.0

✅ Five commands. Zero reboots. NPU live at /dev/accel/accel0. SSH session never dropped.

Test 2: Model Download & Load (Qwen3-8B)

$ lemonade run Qwen3-8B-GGUF
Model 'Qwen3-8B-GGUF' is not downloaded. Pulling...
Pulling model: Qwen3-8B-GGUF
Total: 4.9 GB, 2 files
[1/2] Qwen3-8B-Q4_1.gguf (5004.7 MB)
  Progress: 100% (5004.7/5004.7 MB) 84.8 MB/s
[2/2] config.json (0.0 MB)
  Progress: 100% (0.0/0.0 MB)
Model pulled successfully: Qwen3-8B-GGUF
Model loaded successfully!
Opening URL: http://127.0.0.1:13305/

✅ 4.9 GB model downloaded at 84.8 MB/s. Loaded instantly. Server live on port 13305.

Test 3: First Inference — Haiku Generation

$ curl -s http://127.0.0.1:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-GGUF",
    "messages": [{"role": "user", "content": "Write a haiku about remote servers"}],
    "stream": false
  }' | jq .

Response:

{
  "choices": [{
    "finish_reason": "stop",
    "message": {
      "content": "Silent data streams,\nCables hum in the dark—\nRemote hearts beat.",
      "reasoning_content": "Okay, the user wants a haiku about remote servers.
        Let me start by recalling what a haiku is...
        First line: 'Silent data streams' – that's 5 syllables.
        Second line: 'Cables hum in the dark' – that's 7.
        Third line: 'Remote hearts beat' – 5 syllables."
    }
  }],
  "model": "Qwen3-8B-Q4_1.gguf",
  "usage": {
    "completion_tokens": 549,
    "prompt_tokens": 15,
    "total_tokens": 564
  }
}

The irony? A server in Canada, controlled from Asia, writing poetry about remote servers. 🎭

⚡ Performance Breakdown (Qwen3-8B Q4_1 on XDNA 2 NPU)

Metric	Value	Notes
Tokens per second	15.75	Solid for 8B Q4_1 on integrated NPU
Time to first token	64 ms	Virtually instant prompt processing
Completion tokens	549	Including hidden chain-of-thought reasoning
KV cache hit	14 tokens	Lemonade's built-in caching at work
Per-token latency	63.5 ms	Consistent decode speed
Model size (Q4_1 quant)	5.0 GB	Fits easily in 89 Gi RAM

✅ 16 tokens per second on a 4.9 GB model running on an integrated NPU — with 84 Gi of RAM still free.

Bonus Discovery: Qwen3-8B's Hidden Reasoning

Qwen3-8B is a thinking model — it exposes its entire chain-of-thought in the reasoning_content field. In the haiku test, it literally counted syllables on its virtual fingers:

"First line: Maybe something about the servers themselves. 'Silent data streams' – that's 5 syllables. Second line: Needs 7 syllables. 'Cables hum in the dark' – that's 7. Third line: 'Remote hearts beat' – 5 syllables."

This makes Qwen3-8B an excellent choice for debugging agent behavior — you can see exactly what the model was thinking before it spoke.

📊 Official AMD Benchmarks: How It Scales

AMD's official benchmarks on a Ryzen AI 9 HX 370 laptop (Radeon 890M iGPU, 32GB RAM) running DeepSeek-R1-Distill-Llama-8B (INT4):

Context Length	Time to First Token	Tokens/Second
128 tokens	0.94s	20.7 tok/s
256 tokens	1.14s	20.5 tok/s
512 tokens	1.65s	20.0 tok/s
1,024 tokens	2.68s	19.2 tok/s
2,048 tokens	5.01s	17.6 tok/s

And from the Hacker News community, Strix Halo users (with up to 128GB unified memory) reported:

GPT-OSS 120B: ~50 tok/s

Qwen3-Coder-Next: ~43 tok/s (Q4)

Qwen3.5 35B-A3B: ~55 tok/s (Q4)

Fifty tokens per second on a 120-billion-parameter model — with no discrete GPU. That's fast enough for fluid, real-time conversation.

My Results vs. Official Benchmarks

Model	My Test (NPU)	AMD Official (iGPU)	Notes
Qwen3-8B (Q4_1)	15.75 tok/s	—	Real test, remote via SSH
DeepSeek-R1-Llama-8B (INT4)	—	20.7 tok/s	AMD's official numbers

My result is slightly lower than AMD's DeepSeek benchmark — but that's expected since (a) Qwen3-8B is a different model, (b) I'm using Q4_1 quantization vs INT4, and (c) the NPU was running at a cool 33°C with no thermal throttling whatsoever.

🚀 Getting Started: Real Commands, Real Results

🐧 Ubuntu 26.04 (Resolute Raccoon) — Fully Verified!

This is the installation I ran on a Ryzen AI 9 HX 370 mini PC with 89 Gi RAM and XDNA 2 NPU, connected remotely via SSH from another continent. Every command below produced the exact output shown.

# Step 1: Add the official AMD-backed PPA
sudo add-apt-repository ppa:lemonade-team/stable
sudo apt update

# Step 2: Install the NPU kernel driver + Lemonade server
sudo apt install amdxdna-dkms lemonade-server

# Step 3: Load the NPU kernel module (NO REBOOT NEEDED!)
sudo modprobe amdxdna

# Step 4: Verify the NPU device is alive
ls -l /dev/accel/accel0
# Output: crw-rw---- 1 root render 261, 0 ... /dev/accel/accel0

# Step 5: Confirm Lemonade version
lemonade --version
# Output: lemonade version 10.8.0

# Step 6: See available models (90+ models!)
lemonade list

# Step 7: Run your first model (auto-downloads + starts chat)
lemonade run Qwen3-8B-GGUF

That's it. Five commands, zero reboots, the NPU is live and ready. 🎉

⚠️ CRITICAL NOTE: The PPA is ppa:lemonade-team/stable — NOT lemonade-sdk. And the package is lemonade-server, not just lemonade. Many guides get this wrong.

🧪 Test It With cURL

# Non-streaming: get the full response
curl -s http://127.0.0.1:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-GGUF",
    "messages": [{"role": "user", "content": "Explain NPU vs GPU in one sentence"}]
  }' | jq -r '.choices[0].message.content'

# Streaming: watch tokens appear in real-time
curl -s http://127.0.0.1:13305/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-GGUF",
    "messages": [{"role": "user", "content": "Write a haiku about remote servers"}],
    "stream": true
  }'

⚠️ The correct API path is /v1/chat/completions — NOT /api/chat/completions. This tripped me up during testing.

🔐 Secure Boot? Here's What Happens

If sudo modprobe amdxdna fails with "Key was rejected by service", Secure Boot is blocking the unsigned kernel module. You have two options:

Option A — Sign the module (no reboot for install, reboot needed for MOK enrollment):

sudo kmodsign sha512 /var/lib/shim-signed/mok/MOK.priv \
    /var/lib/shim-signed/mok/MOK.der \
    /lib/modules/$(uname -r)/updates/dkms/amdxdna.ko

Option B — Disable Secure Boot in BIOS (needs physical access or IPMI/BMC):

On my test machine, Secure Boot was disabled, so the module loaded instantly with just modprobe. The NPU device appeared at /dev/accel/accel0 immediately — no reboot required.

🪟 Windows (One-Click Install)

# Download from GitHub Releases
# https://github.com/lemonade-sdk/lemonade/releases/latest
# Run the .msi installer — auto-detects your hardware

lemonade run Gemma-3-4b-it-GGUF

Server live at http://localhost:13305.

🍎 macOS (Beta)

pip install lemonade-sdk
lemonade run Gemma-3-4b-it-GGUF

Running lemonade list on a fresh install reveals an impressive buffet of 90+ models. Here's a curated selection — everything I saw on my actual system:

🥇 Text Models (llama.cpp backend — runs on NPU)

Model	Best For	Approx. Size
Gemma-3-4b-it-GGUF	Fast first test, general chat	~2.5 GB
Gemma-4-12B-it-GGUF	Advanced reasoning	~7 GB
Gemma-4-26B-A4B-it-GGUF	MoE, 4B active per token	~15 GB
Llama-4-Scout-17B-16E-Instruct-GGUF	Meta's latest MoE	~10 GB
Qwen3-8B-GGUF ⭐	All-around workhorse (tested!)	~5 GB
Qwen3-Coder-30B-A3B-Instruct-GGUF	Code generation beast	~18 GB
Qwen3.5-35B-A3B-GGUF	State-of-art MoE	~20 GB
DeepSeek-Qwen3-8B-GGUF	Open-source frontier	~5 GB
Phi-4-mini-instruct-GGUF	Microsoft's mini marvel	~3 GB
GPT-OSS-120B-GGUF	Massive 120B model (needs Strix Halo)	~70 GB

⭐ = Personally verified on Ryzen AI 9 HX 370

🎨 Image Models (SD-CPP backend — runs on NPU/GPU)

Model	What It Does
SDXL-Turbo	Fast image generation (~1-2s on NPU)
SD-1.5	Classic Stable Diffusion
SD-Turbo	Even faster, slightly lower quality
Qwen-Image-2512-GGUF	Qwen's image generation model
Flux-2-Klein-9B-GGUF	Flux image generation

🎤 Speech Models

Model	Type
Whisper-Large-v3-Turbo	Speech-to-text (best accuracy)
Whisper-Medium	Speech-to-text (balanced)
Whisper-Tiny	Speech-to-text (fastest)
kokoro-v1	Text-to-speech (voice output!)
Moonshine-Streaming	Real-time streaming STT

🧩 Special Collections

Collection	What's Inside
Lite Collection	Multi-modal bundle for modest hardware
Ultra Collection	Full multi-modal suite for high-end rigs

🔌 OpenAI API Compatible: Plug & Play With Hundreds of Apps

This is the real superpower. Lemonade exposes a standard OpenAI-compatible API at http://localhost:13305/v1. Any app that speaks OpenAI — and that's basically everything — works instantly:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:13305/v1",
    api_key="lemonade"  # required but unused
)

response = client.chat.completions.create(
    model="Qwen3-8B-GGUF",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

Verified API Endpoints

What You Want	Correct Endpoint
💬 Chat	`POST /v1/chat/completions`
📝 Text completion	`POST /v1/completions`
🧮 Embeddings	`POST /v1/embeddings`
🎤 Transcription	`POST /v1/audio/transcriptions`
🖼️ Image generation	`POST /v1/images/generations`
📋 List models	`GET /v1/models`
❤️ Health check	`GET /api/v1/health`

Apps That Work Out of the Box:

VS Code (via official Copilot extension)
Open WebUI (self-hosted ChatGPT-like interface)
Continue (IDE coding assistant)
n8n (workflow automation)
Dify (AI app builder)
Plus any OpenAI SDK in Python, Node.js, Go, Rust, C#, Java, Ruby, or PHP

The team even maintains a Marketplace of verified integrations.

🔒 Privacy: Your Data Never Leaves Your Desk

This is the part that makes privacy-conscious folks smile:

✅ 100% local execution — nothing sent to the cloud
✅ No telemetry — the project explicitly states no data collection
✅ No account required — no sign-ups, no API keys
✅ Apache 2.0 license — audit, modify, and redistribute freely

For anyone handling sensitive work documents, healthcare data, or proprietary code — Lemonade means you can use cutting-edge AI without the privacy risk.

🆚 Lemonade vs. Ollama: The Honest Comparison

Feature	🍋 Lemonade	🦙 Ollama
Primary strength	AMD optimization + multi-modality	Cross-platform model serving
NPU support	✅ XDNA2 (Ryzen AI)	❌ None
Modalities	Chat, Vision, Image Gen, TTS, STT	Chat, Vision
Binary size	~2MB (server)	~200MB
Multiple models	✅ Simultaneously	One at a time
Mobile app	✅ iOS + Android	❌
API compatibility	OpenAI, Ollama, Anthropic	Ollama, OpenAI (partial)
GPU backends	ROCm, Vulkan, CUDA, Metal	CUDA, ROCm, Metal
No-reboot NPU activation	✅ `modprobe amdxdna`	N/A

Verdict: If you're on AMD hardware, Lemonade is the clear winner. On NVIDIA or Apple Silicon, both are viable — but Lemonade's multi-modality and tiny footprint are compelling advantages regardless of your GPU brand.

One HN user ran a quick test on an M1 Max MacBook: "Model: qwen3.5-9b. Ollama completed in about 1:44. Lemonade completed in about 1:14. So it seems faster in this very limited test."

💻 Real Hardware Verification: Test Setup & Results

My test environment was a remote Ubuntu 26.04 (Resolute Raccoon) mini PC accessed via SSH from Asia while the machine sat in Canada:

Component	What We Verified
CPU	AMD Ryzen AI 9 HX 370
GPU	Radeon 880M / 890M (Strix Point, RDNA 3.5)
NPU	XDNA 2 — detected at `c5:00.1`, device `/dev/accel/accel0`
RAM	89 Gi (96 GB LPDDR5x)
Kernel	7.0.0-22-generic
Driver	`amdgpu` + `amdxdna` (DKMS)
Lemonade	v10.8.0 from PPA `ppa:lemonade-team/stable`
NPU Temp	33°C (idle)
VRAM	2 GB BIOS-allocated, 89 Gi shared via UMA
Test Model	Qwen3-8B-GGUF (4.9 GB, Q4_1 quant)
Tokens/sec	15.75 tok/s — verified via `/v1/chat/completions` API
Install Time	~2 minutes (PPA + apt install + modprobe)
Reboots	Zero — SSH session never dropped

✅ What We Proved

Claim	Verified?
NPU loads without reboot (`modprobe amdxdna`)	✅ YES
`/dev/accel/accel0` appears immediately	✅ YES
PPA packages work on Ubuntu 26.04	✅ YES
Auto-downloads models at 84.8 MB/s	✅ YES
OpenAI-compatible API at `/v1/chat/completions`	✅ YES
~16 tok/s on 8B model with integrated NPU	✅ YES
Reasoning models expose chain-of-thought	✅ YES
Full remote install possible via SSH	✅ YES

🗺️ The Road Ahead

Lemonade just shipped v10.8.0 (June 17, 2026), which added:

Live model management — auto-unload idle models, pin frequently used ones
Cloud offload — route to any OpenAI-compatible cloud provider alongside local models (experimental)
MCP Gateway — let external tools and agents call local models
Expanded platform support — NVIDIA GB10 arm64, Debian 13, ROCm for Radeon GPUs
Ubuntu 26.04 (Resolute) packages — the lemonade-server deb landed June 17, supporting the latest LTS

The project maintains an active Discord community and a transparent roadmap driven by community working groups.

🎯 Who Should Try Lemonade?

AMD AI PC owners: Finally, something that actually uses that NPU you paid for
Privacy-conscious professionals: Lawyers, doctors, developers handling sensitive data
Remote homelabbers: Install with modprobe, no reboot, NPU live while you're on another continent
Developers & tinkerers: Build AI-powered apps with zero cloud costs
Casual AI users: Free, unlimited access to models like Gemma, Llama, Qwen, and Mistral
Anyone tired of monthly AI subscriptions: Your hardware, your models, your rules

🥤 The Bottom Line

Lemonade lives up to its name: it takes something complex — running AI models locally across different hardware — and makes it refreshingly simple. AMD's backing gives it serious credibility, and the open-source community is shipping features at an impressive pace.

The real kicker? I verified the entire thing on a remote Ubuntu 26.04 mini PC from halfway across the world. One PPA, two packages, one modprobe, zero reboots — and a 50+ TOPS NPU was cranking out 15.75 tokens per second on Qwen3-8B, writing haikus about the very remote server it was running on.

"Silent data streams, Cables hum in the dark— Remote hearts beat."

— Qwen3-8B, running on XDNA 2 NPU, June 2026

Your PC is more than just a computer. With Lemonade, it becomes your personal AI brain — private, free, and ridiculously fast.

Ready to take a sip? Head to lemonade-server.ai and give it a spin.

📚 Verified Source URLs

Lemonade Official Website: https://lemonade-server.ai/
GitHub Repository: https://github.com/lemonade-sdk/lemonade
GitHub Releases (v10.8.0): https://github.com/lemonade-sdk/lemonade/releases
AMD Official Developer Article (Feb 10, 2026): https://www.amd.com/en/developer/resources/technical-articles/2026/lemonade-for-local-ai.html
Official PPA (lemonade-team/stable): https://launchpad.net/~lemonade-team/+archive/ubuntu/stable
ComputeLeap — "AMD's Lemonade Just Made Every Nvidia-Only AI Guide Obsolete": https://www.computeleap.com/blog/amd-lemonade-local-llm-server-guide-2026/
RunAIHome — "AMD Lemonade Local LLM Server Guide 2026": https://runaihome.com/blog/amd-lemonade-local-llm-server-npu-gpu-guide-2026/
Hacker News Discussion: https://news.ycombinator.com/item?id=47612724
Lemonade Discord Community: https://discord.gg/5xXzkMu8Zk
Agent Wars — "AMD's Lemonade: Local AI Server That Actually Works on AMD Hardware": https://www.agent-wars.com/news/2026-04-05-amds-lemonade-local-ai-server
Lilting Channel — AMD Lemonade Architecture Analysis: https://lilting.ch/en/articles/amd-lemonade-local-ai-gpu-npu-server