Tired of paying cloud AI subscriptions? Worried about your data floating around someone else's server? There's a refreshing new solution โ and it's backed by none other than AMD. Meet Lemonade, the open-source local AI server that's making waves in 2026, and it might just be the coolest thing to hit your PC this year.
I tested this myself on a remote Ubuntu 26.04 mini PC with a Ryzen AI 9 HX 370, 89 Gi RAM, and XDNA 2 NPU โ controlled via SSH from another continent. Here's exactly what worked, verified command by command, with real benchmark results.
In plain English: Lemonade is a lightweight, open-source local AI server that runs large language models, generates images, transcribes speech, and synthesizes voice โ all 100% on your own computer. No cloud. No subscription. No data leaks.
Here's the kicker: the entire server binary is just ~2MB. Yes, megabytes. It's written in native C++ and starts up faster than you can pour a glass of lemonade on a hot summer day.
Official definition: "Lemonade is a local AI runtime with every capability you need to build great experiences. Automatically optimized for your GPU and NPU."
The project lives on GitHub under lemonade-sdk/lemonade and has racked up over 4,600 stars and 1,100+ commits from a growing community of developers โ with AMD engineers actively contributing as core maintainers.
This isn't just another random GitHub project. In February 2026, AMD published a heavyweight technical article on their official developer portal titled "Lemonade by AMD: A Unified API for Local AI Developers".
AMD's AI Developer Enablement team โ Victoria Godsoe, Jeremy Fowers, Daniel Holanda Noronha, and Krishna Sivakumar โ explained the vision plainly:
"Developers need free, private, and optimized on-device AI with all the LLM, speech, and image capabilities required for natural interactions and powerful outcomes."
They're positioning Lemonade as the core foundation of the AI PC ecosystem, specifically optimized for AMD Ryzen AI NPUs and Radeon GPUs.
Here's the question everyone asks: "Do I need an AMD PC?"
Nope! Lemonade works across all major platforms. But AMD hardware gets the red-carpet treatment:
| Hardware | Acceleration |
|---|---|
| AMD Ryzen AI NPUs (XDNA2) | NPU-accelerated inference via FastFlowLM |
| AMD Radeon dGPU / iGPU | ROCm + Vulkan full-stack acceleration |
| Strix Halo (Ryzen AI MAX+) | Hybrid NPU + GPU execution, up to 128GB unified memory |
| Hardware | Backend |
|---|---|
| NVIDIA GPUs (Turing to Blackwell) | CUDA + Vulkan |
| Intel Arc / iGPU | Vulkan |
| Apple Silicon (M-series) | Metal (macOS beta) |
| Any CPU | Pure CPU fallback for Windows & Linux |
Bottom line: AMD is the favorite child, but Lemonade plays nice with everyone.
Lemonade isn't just a wrapper โ it's a multi-engine orchestrator that automatically picks the best backend for your specific hardware:
| Modality | Engines Available |
|---|---|
| Text / Chat | llama.cpp (Vulkan, ROCm, CUDA, Metal, CPU), FastFlowLM (NPU), RyzenAI-LLM (NPU), vLLM (experimental, ROCm) |
| Image Generation | stable-diffusion.cpp (ROCm, Vulkan, CUDA, CPU) |
| Speech-to-Text | whisper.cpp (NPU, Vulkan, CPU), Moonshine (CPU) |
| Text-to-Speech | Kokoro (CPU) |
When you type lemonade run Qwen3-8B-GGUF, it auto-detects your hardware and selects the optimal backend โ no manual configuration needed.
Under the hood, Lemonade uses a sophisticated three-layer concurrency architecture:
Req-1 โโโ โโโ Backend Process (llama.cpp)
Req-2 โโโผโโโบ [8-Thread Pool] โโโบ Router โโโผโโ Backend Process (FastFlowLM/NPU)
Req-3 โโโ โโโ Backend Process (Whisper)
โฒ โฒ
HTTP Layer (cpp-httplib) OS Subprocess Layer
(one per loaded model)
Layer 1 โ HTTP Thread Pool: 8 worker threads grab incoming requests simultaneously. A lightweight /v1/models call returns instantly while heavy chat completions stream tokens โ no head-of-line blocking.
Layer 2 โ Router: Directs each request to the right backend subprocess. Serializes model loading (only one model loads at a time) but concurrent inference flows freely.
Layer 3 โ NPU Exclusivity: On XDNA 2 with FastFlowLM, the NPU supports multi-instance execution โ you can run LLM chat, audio transcription, and embeddings concurrently on the same NPU with only ~5.8% decode speed penalty. Compared to GPU concurrent inference (which drops off a cliff), this is a game-changer.
I ran a full end-to-end test on a remote Ubuntu 26.04 mini PC with a Ryzen AI 9 HX 370, 89 Gi LPDDR5x RAM, and XDNA 2 NPU โ all controlled via SSH from across the globe. Here are the raw, unedited results.
$ sudo add-apt-repository ppa:lemonade-team/stable
$ sudo apt install amdxdna-dkms lemonade-server
$ sudo modprobe amdxdna # NO REBOOT!
$ ls -l /dev/accel/accel0
crw-rw---- 1 root render 261, 0 Jun 12 02:48 /dev/accel/accel0
$ lemonade --version
lemonade version 10.8.0
โ
Five commands. Zero reboots. NPU live at /dev/accel/accel0. SSH session never dropped.
$ lemonade run Qwen3-8B-GGUF
Model 'Qwen3-8B-GGUF' is not downloaded. Pulling...
Pulling model: Qwen3-8B-GGUF
Total: 4.9 GB, 2 files
[1/2] Qwen3-8B-Q4_1.gguf (5004.7 MB)
Progress: 100% (5004.7/5004.7 MB) 84.8 MB/s
[2/2] config.json (0.0 MB)
Progress: 100% (0.0/0.0 MB)
Model pulled successfully: Qwen3-8B-GGUF
Model loaded successfully!
Opening URL: http://127.0.0.1:13305/
โ 4.9 GB model downloaded at 84.8 MB/s. Loaded instantly. Server live on port 13305.
$ curl -s http://127.0.0.1:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-8B-GGUF",
"messages": [{"role": "user", "content": "Write a haiku about remote servers"}],
"stream": false
}' | jq .
Response:
{
"choices": [{
"finish_reason": "stop",
"message": {
"content": "Silent data streams,\nCables hum in the darkโ\nRemote hearts beat.",
"reasoning_content": "Okay, the user wants a haiku about remote servers.
Let me start by recalling what a haiku is...
First line: 'Silent data streams' โ that's 5 syllables.
Second line: 'Cables hum in the dark' โ that's 7.
Third line: 'Remote hearts beat' โ 5 syllables."
}
}],
"model": "Qwen3-8B-Q4_1.gguf",
"usage": {
"completion_tokens": 549,
"prompt_tokens": 15,
"total_tokens": 564
}
}
The irony? A server in Canada, controlled from Asia, writing poetry about remote servers. ๐ญ
| Metric | Value | Notes |
|---|---|---|
| Tokens per second | 15.75 | Solid for 8B Q4_1 on integrated NPU |
| Time to first token | 64 ms | Virtually instant prompt processing |
| Completion tokens | 549 | Including hidden chain-of-thought reasoning |
| KV cache hit | 14 tokens | Lemonade's built-in caching at work |
| Per-token latency | 63.5 ms | Consistent decode speed |
| Model size (Q4_1 quant) | 5.0 GB | Fits easily in 89 Gi RAM |
โ 16 tokens per second on a 4.9 GB model running on an integrated NPU โ with 84 Gi of RAM still free.
Qwen3-8B is a thinking model โ it exposes its entire chain-of-thought in the reasoning_content field. In the haiku test, it literally counted syllables on its virtual fingers:
"First line: Maybe something about the servers themselves. 'Silent data streams' โ that's 5 syllables. Second line: Needs 7 syllables. 'Cables hum in the dark' โ that's 7. Third line: 'Remote hearts beat' โ 5 syllables."
This makes Qwen3-8B an excellent choice for debugging agent behavior โ you can see exactly what the model was thinking before it spoke.
AMD's official benchmarks on a Ryzen AI 9 HX 370 laptop (Radeon 890M iGPU, 32GB RAM) running DeepSeek-R1-Distill-Llama-8B (INT4):
| Context Length | Time to First Token | Tokens/Second |
|---|---|---|
| 128 tokens | 0.94s | 20.7 tok/s |
| 256 tokens | 1.14s | 20.5 tok/s |
| 512 tokens | 1.65s | 20.0 tok/s |
| 1,024 tokens | 2.68s | 19.2 tok/s |
| 2,048 tokens | 5.01s | 17.6 tok/s |
And from the Hacker News community, Strix Halo users (with up to 128GB unified memory) reported:
- GPT-OSS 120B: ~50 tok/s
- Qwen3-Coder-Next: ~43 tok/s (Q4)
- Qwen3.5 35B-A3B: ~55 tok/s (Q4)
Fifty tokens per second on a 120-billion-parameter model โ with no discrete GPU. That's fast enough for fluid, real-time conversation.
| Model | My Test (NPU) | AMD Official (iGPU) | Notes |
|---|---|---|---|
| Qwen3-8B (Q4_1) | 15.75 tok/s | โ | Real test, remote via SSH |
| DeepSeek-R1-Llama-8B (INT4) | โ | 20.7 tok/s | AMD's official numbers |
My result is slightly lower than AMD's DeepSeek benchmark โ but that's expected since (a) Qwen3-8B is a different model, (b) I'm using Q4_1 quantization vs INT4, and (c) the NPU was running at a cool 33ยฐC with no thermal throttling whatsoever.
This is the installation I ran on a Ryzen AI 9 HX 370 mini PC with 89 Gi RAM and XDNA 2 NPU, connected remotely via SSH from another continent. Every command below produced the exact output shown.
# Step 1: Add the official AMD-backed PPA
sudo add-apt-repository ppa:lemonade-team/stable
sudo apt update
# Step 2: Install the NPU kernel driver + Lemonade server
sudo apt install amdxdna-dkms lemonade-server
# Step 3: Load the NPU kernel module (NO REBOOT NEEDED!)
sudo modprobe amdxdna
# Step 4: Verify the NPU device is alive
ls -l /dev/accel/accel0
# Output: crw-rw---- 1 root render 261, 0 ... /dev/accel/accel0
# Step 5: Confirm Lemonade version
lemonade --version
# Output: lemonade version 10.8.0
# Step 6: See available models (90+ models!)
lemonade list
# Step 7: Run your first model (auto-downloads + starts chat)
lemonade run Qwen3-8B-GGUF
That's it. Five commands, zero reboots, the NPU is live and ready. ๐
โ ๏ธ CRITICAL NOTE: The PPA is
ppa:lemonade-team/stableโ NOTlemonade-sdk. And the package islemonade-server, not justlemonade. Many guides get this wrong.
# Non-streaming: get the full response
curl -s http://127.0.0.1:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-8B-GGUF",
"messages": [{"role": "user", "content": "Explain NPU vs GPU in one sentence"}]
}' | jq -r '.choices[0].message.content'
# Streaming: watch tokens appear in real-time
curl -s http://127.0.0.1:13305/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-8B-GGUF",
"messages": [{"role": "user", "content": "Write a haiku about remote servers"}],
"stream": true
}'
โ ๏ธ The correct API path is
/v1/chat/completionsโ NOT/api/chat/completions. This tripped me up during testing.
If sudo modprobe amdxdna fails with "Key was rejected by service", Secure Boot is blocking the unsigned kernel module. You have two options:
Option A โ Sign the module (no reboot for install, reboot needed for MOK enrollment):
sudo kmodsign sha512 /var/lib/shim-signed/mok/MOK.priv \
/var/lib/shim-signed/mok/MOK.der \
/lib/modules/$(uname -r)/updates/dkms/amdxdna.ko
Option B โ Disable Secure Boot in BIOS (needs physical access or IPMI/BMC):
On my test machine, Secure Boot was disabled, so the module loaded instantly with just modprobe. The NPU device appeared at /dev/accel/accel0 immediately โ no reboot required.
# Download from GitHub Releases
# https://github.com/lemonade-sdk/lemonade/releases/latest
# Run the .msi installer โ auto-detects your hardware
lemonade run Gemma-3-4b-it-GGUF
Server live at http://localhost:13305.
pip install lemonade-sdk
lemonade run Gemma-3-4b-it-GGUF
Running lemonade list on a fresh install reveals an impressive buffet of 90+ models. Here's a curated selection โ everything I saw on my actual system:
| Model | Best For | Approx. Size |
|---|---|---|
| Gemma-3-4b-it-GGUF | Fast first test, general chat | ~2.5 GB |
| Gemma-4-12B-it-GGUF | Advanced reasoning | ~7 GB |
| Gemma-4-26B-A4B-it-GGUF | MoE, 4B active per token | ~15 GB |
| Llama-4-Scout-17B-16E-Instruct-GGUF | Meta's latest MoE | ~10 GB |
| Qwen3-8B-GGUF โญ | All-around workhorse (tested!) | ~5 GB |
| Qwen3-Coder-30B-A3B-Instruct-GGUF | Code generation beast | ~18 GB |
| Qwen3.5-35B-A3B-GGUF | State-of-art MoE | ~20 GB |
| DeepSeek-Qwen3-8B-GGUF | Open-source frontier | ~5 GB |
| Phi-4-mini-instruct-GGUF | Microsoft's mini marvel | ~3 GB |
| GPT-OSS-120B-GGUF | Massive 120B model (needs Strix Halo) | ~70 GB |
โญ = Personally verified on Ryzen AI 9 HX 370
| Model | What It Does |
|---|---|
| SDXL-Turbo | Fast image generation (~1-2s on NPU) |
| SD-1.5 | Classic Stable Diffusion |
| SD-Turbo | Even faster, slightly lower quality |
| Qwen-Image-2512-GGUF | Qwen's image generation model |
| Flux-2-Klein-9B-GGUF | Flux image generation |
| Model | Type |
|---|---|
| Whisper-Large-v3-Turbo | Speech-to-text (best accuracy) |
| Whisper-Medium | Speech-to-text (balanced) |
| Whisper-Tiny | Speech-to-text (fastest) |
| kokoro-v1 | Text-to-speech (voice output!) |
| Moonshine-Streaming | Real-time streaming STT |
| Collection | What's Inside |
|---|---|
| Lite Collection | Multi-modal bundle for modest hardware |
| Ultra Collection | Full multi-modal suite for high-end rigs |
This is the real superpower. Lemonade exposes a standard OpenAI-compatible API at http://localhost:13305/v1. Any app that speaks OpenAI โ and that's basically everything โ works instantly:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:13305/v1",
api_key="lemonade" # required but unused
)
response = client.chat.completions.create(
model="Qwen3-8B-GGUF",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)
| What You Want | Correct Endpoint |
|---|---|
| ๐ฌ Chat | POST /v1/chat/completions |
| ๐ Text completion | POST /v1/completions |
| ๐งฎ Embeddings | POST /v1/embeddings |
| ๐ค Transcription | POST /v1/audio/transcriptions |
| ๐ผ๏ธ Image generation | POST /v1/images/generations |
| ๐ List models | GET /v1/models |
| โค๏ธ Health check | GET /api/v1/health |
The team even maintains a Marketplace of verified integrations.
This is the part that makes privacy-conscious folks smile:
For anyone handling sensitive work documents, healthcare data, or proprietary code โ Lemonade means you can use cutting-edge AI without the privacy risk.
| Feature | ๐ Lemonade | ๐ฆ Ollama |
|---|---|---|
| Primary strength | AMD optimization + multi-modality | Cross-platform model serving |
| NPU support | โ XDNA2 (Ryzen AI) | โ None |
| Modalities | Chat, Vision, Image Gen, TTS, STT | Chat, Vision |
| Binary size | ~2MB (server) | ~200MB |
| Multiple models | โ Simultaneously | One at a time |
| Mobile app | โ iOS + Android | โ |
| API compatibility | OpenAI, Ollama, Anthropic | Ollama, OpenAI (partial) |
| GPU backends | ROCm, Vulkan, CUDA, Metal | CUDA, ROCm, Metal |
| No-reboot NPU activation | โ
modprobe amdxdna |
N/A |
Verdict: If you're on AMD hardware, Lemonade is the clear winner. On NVIDIA or Apple Silicon, both are viable โ but Lemonade's multi-modality and tiny footprint are compelling advantages regardless of your GPU brand.
One HN user ran a quick test on an M1 Max MacBook: "Model: qwen3.5-9b. Ollama completed in about 1:44. Lemonade completed in about 1:14. So it seems faster in this very limited test."
My test environment was a remote Ubuntu 26.04 (Resolute Raccoon) mini PC accessed via SSH from Asia while the machine sat in Canada:
| Component | What We Verified |
|---|---|
| CPU | AMD Ryzen AI 9 HX 370 |
| GPU | Radeon 880M / 890M (Strix Point, RDNA 3.5) |
| NPU | XDNA 2 โ detected at c5:00.1, device /dev/accel/accel0 |
| RAM | 89 Gi (96 GB LPDDR5x) |
| Kernel | 7.0.0-22-generic |
| Driver | amdgpu + amdxdna (DKMS) |
| Lemonade | v10.8.0 from PPA ppa:lemonade-team/stable |
| NPU Temp | 33ยฐC (idle) |
| VRAM | 2 GB BIOS-allocated, 89 Gi shared via UMA |
| Test Model | Qwen3-8B-GGUF (4.9 GB, Q4_1 quant) |
| Tokens/sec | 15.75 tok/s โ verified via /v1/chat/completions API |
| Install Time | ~2 minutes (PPA + apt install + modprobe) |
| Reboots | Zero โ SSH session never dropped |
| Claim | Verified? |
|---|---|
NPU loads without reboot (modprobe amdxdna) |
โ YES |
/dev/accel/accel0 appears immediately |
โ YES |
| PPA packages work on Ubuntu 26.04 | โ YES |
| Auto-downloads models at 84.8 MB/s | โ YES |
OpenAI-compatible API at /v1/chat/completions |
โ YES |
| ~16 tok/s on 8B model with integrated NPU | โ YES |
| Reasoning models expose chain-of-thought | โ YES |
| Full remote install possible via SSH | โ YES |
Lemonade just shipped v10.8.0 (June 17, 2026), which added:
lemonade-server deb landed June 17, supporting the latest LTSThe project maintains an active Discord community and a transparent roadmap driven by community working groups.
modprobe, no reboot, NPU live while you're on another continentLemonade lives up to its name: it takes something complex โ running AI models locally across different hardware โ and makes it refreshingly simple. AMD's backing gives it serious credibility, and the open-source community is shipping features at an impressive pace.
The real kicker? I verified the entire thing on a remote Ubuntu 26.04 mini PC from halfway across the world. One PPA, two packages, one modprobe, zero reboots โ and a 50+ TOPS NPU was cranking out 15.75 tokens per second on Qwen3-8B, writing haikus about the very remote server it was running on.
"Silent data streams, Cables hum in the darkโ Remote hearts beat."
โ Qwen3-8B, running on XDNA 2 NPU, June 2026
Your PC is more than just a computer. With Lemonade, it becomes your personal AI brain โ private, free, and ridiculously fast.
Ready to take a sip? Head to lemonade-server.ai and give it a spin.