Chasing a Whisper-Quiet Private Voice Brain on My Ryzen Rig (No GPU, No Excuses)
By Finn Harlow • March 2026
The Spark
One too many late-night doomscrolls about AI outfits hoovering up voice data for “improvement,” and I hit my limit. My Ryzen 5 5600G box—16GB RAM, integrated graphics, Windows 10 purring quietly—sits right here, yet every time I want to talk to my computer it’s still shipping my half-mumbled thoughts straight to the cloud. I wanted something different: a local voice assistant I could throw quick questions at—math, reminders, “explain this regex disaster”—without feeding Big Tech my sleepy ramblings. Zero subscriptions, zero phoning home.
Good enough on this hardware meant STT that actually hears me without endless pauses, LLM replies in under ten seconds end-to-end, TTS that doesn’t sound like a 90s GPS, all CPU-only so power stays sane (under 100W peak) and the fans don’t impersonate a hair dryer on high. Full offline privacy. If it felt like muttering to a chill friend across the table, I’d take the win.
Path Hunting
Scoured repos and threads for stacks that wouldn’t melt my rig. Ollama + Whisper + Piper looked approachable—clean scripts, decent Windows support. Pros: straightforward. Cons: stock Whisper loves RAM; Ollama OOMs fast on 16GB without brutal quantization.
llama.cpp + whisper.cpp felt scrappier and meaner—both pure C++ speed demons built for CPU. Pros: quantization baked in, whisper.cpp streaming for lower latency, and the late-2025/early-2026 Silero VAD updates (v6.2.0 especially) cleaned up silence detection without piling on dependencies. Cons: Windows builds mean CMake battles, but the payoff looked worth it.
Home Assistant Assist dangled polish—wake words, intents, nice UI. Pros: scales nicely if I ever hook it to lights. Cons: VM overhead would pin the CPU and send the cooler into full tornado mode. Overkill for what I wanted.
LocalAI or faster-whisper variants crossed my mind. Flexible, but heavier—more OOM risk. I kept circling back to whisper.cpp + llama.cpp + Piper: minimal, efficient, and that Silero v6.2.0 VAD upgrade felt like free performance.
The Build Saga
Kicked off easy with Ollama’s one-liner install. Grabbed Gemma 3 12B Q4_K_M—squeezed in at ~10GB. Typed a test prompt; joke landed in 5–8 seconds, ~10 tok/s. Not fast, but alive on CPU.
STT next. Python Whisper? Ate RAM for breakfast. Built whisper.cpp instead—CMake fought me because I forgot Visual Studio build tools the first time. Downloaded base.en Q5_0, tested on a WAV of me mumbling; transcribed in under 2 seconds. Added streaming flags, piped mic input through a quick ffmpeg batch script.
TTS: Piper from the active fork. Built it, picked an English voice. Sounded warm enough for short replies.
First full loop was pure slapstick. Whisper grabbed “Set timer for 5 minutes,” Ollama chewed for 12 seconds while fans started whining, Piper spat garbled nonsense because I botched the WAV headers. Then Ollama hit a longer context and—boom—full OOM freeze. My rig gasped like a marathon runner at mile 25, paging to disk, mouse frozen, SSD whimpering. I just sat there staring, then burst out laughing at how 16GB suddenly felt like pocket change.
Ripped it apart and rebuilt smarter. Switched to Phi-3-mini on llama.cpp—smaller footprint, snappier. Heavier quantization. Times dropped to 3–6 seconds total, tok/s settled 15–25 on my Ryzen (solid CPU numbers from every bench I could find). Hooked in whisper.cpp’s Silero VAD v6.2.0—night-and-day better at ignoring keyboard clicks and fan whoosh, no extra Python cruft. Latency plunged because it stopped transcribing my breathing pauses.
Windows default input kept dropping syllables until I cranked the ffmpeg buffer—suddenly it caught my mumbles like it was actually paying attention for once. Another gem: asked Piper for a quinoa recipe and it pronounced the word like a drunk phonetic experiment—“kwin-OH-ahhh.” Cracked up, fixed the phoneme map, and filed away the lesson: always test weird words first, dummy.
Tried Home Assistant Assist in a VirtualBox VM for extra polish. Wake word worked, but CPU maxed, fans spun into full panic mode. Wife poked her head in: “Your computer’s mad again.” Killed it quick. Settled on a simple PowerShell loop. Late-night win: --num_keep kept context light, 2048-token window killed most OOMs.
Three nights, 8–10 hours, coffee rings on the desk. Progress.
Results & Reflection
What stuck: whisper.cpp (Silero v6.2.0 VAD keeping chunks clean), llama.cpp with Phi-3-mini Q4-quantized, Piper TTS, everything glued with scripts. Cost: zilch. Power: 60–80W while chatting, 40W idle. Heat never topped 70°C. Performance: STT 1–2s, LLM 3–6s (15–25 tok/s feels responsive enough on CPU), TTS near-instant. Loop latency low enough I don’t twitch waiting.
The lean pipeline that survived the chaos:
Mic → ffmpeg → whisper.cpp (Silero VAD) → text
↓
llama.cpp (Phi-3-mini Q4)
↓
Piper TTS → Speakers
Next tweaks: swap the built-in mic for something directional—too much room noise right now. Maybe poke at newer lightweight TTS clones if they stay CPU-friendly. A cheap used GPU down the road would obviously smoke this, but honestly, the setup already over-delivers for what it is.
Worth the hassle? If you want instant, polished, zero-brain cloud convenience, no way—Siri and friends still win there. But if privacy paranoia keeps you up at night, or you just love the feeling of owning your own little brain extension, hell yes. It’s quiet, it’s mine, and no corporation gets to log my grocery-list mutterings. The screw-ups taught me more about quantization and VAD than any readme ever could, and CPU inference in 2026 is legitimately impressive.
Readers, if you’ve chased something similar on Ryzen—or hit epic fails along the way—hit me with your experiments, dead ends, or random gotchas. Maybe we’ll spark the next hack together.