Look, I'm not going to pretend this was purely a practical decision. I like tech. I like new toys. Any excuse to justify new hardware is a good excuse for me. But the reason I could actually justify pulling the trigger on a local LLM workstation? Subscription fatigue and rate limit anxiety were grinding my gears.
You know the feeling. You're in the zone, iterating on an idea, working through the messy early thinking that AI tools are actually good at helping with. And then: "You've reached your limit. Come back in 5 hours."
Or worse - you're considering using the API for a project, doing the mental math on token costs, and realizing you could easily rack up a four-figure bill if you're not careful. The last thing I want is a surprise $1,000+ invoice from Google or Anthropic because I got a little too enthusiastic with my prompting.
I'm not building a SaaS product. I'm not selling AI apps in any store. I'm just trying to do my actual work - strategic planning, consulting, writing - and the foundation models (Claude, Gemini, ChatGPT) are genuinely useful for that work. But the constant mental overhead of "am I using too many tokens?" or "should I upgrade to the next tier?" was becoming its own productivity tax.
So I built an overflow system. A local LLM setup where I can iterate freely, work through messy ideas, and refine my thinking without worrying about rate limits or costs. Then, once I've got something more coherent, I take it to Claude or Gemini for the final polish.
And here's what surprised me: you don't need a $10,000 GPU rig or a server rack in your closet to make this work.
The Problem: Subscription Creep and Rate Limit Whiplash
Here's how the costs add up when you actually use AI tools for work:
- ChatGPT Plus: $20/month
- Claude Pro: $20/month
- Gemini Advanced: $20/month
That's $60/month just to have access to the good models. And even then, you're hitting rate limits. Claude Pro gives you better limits than the free tier, but if you're doing real work - iterating on documents, working through complex problems, having extended back-and-forth sessions - you can still hit the wall.
Then there's the API option. Pay-per-token sounds reasonable until you realize how quickly tokens add up during actual work. Running a few dozen multi-turn conversations through the API for a project? That could easily be $50-$200 depending on context windows and model choice. And the anxiety of "am I being efficient enough with my prompts?" kills the creative flow that makes these tools useful in the first place.
I'm not saying these subscriptions aren't worth it - they absolutely are for the work I do. But I wanted a middle ground: unlimited iteration space for the messy work, reserve the premium models for the refined work.
The workflow I was after:
- Take rough ideas to local LLM, iterate freely
- Work through multiple approaches without token anxiety
- Refine the best direction with unlimited back-and-forth
- Take the polished concept to Claude/Gemini for final quality
Think of it like drafting in Google Docs before finalizing in InDesign. The local LLM isn't better than Claude - it's just unlimited.
The Solution: Local LLM as Overflow Capacity
I didn't set out to replace Claude or Gemini. I just wanted somewhere I could think out loud without watching a usage meter.
What I was looking for:
- A local LLM that could handle 20-30B parameter models (good enough for iteration work)
- Network-accessible so I could use it from any device in my house
- Small form factor (no server rack, no jet engine under my desk)
- Reasonable power consumption (not spinning up a gaming rig 24/7)
- Room to grow if I needed more capacity later
What I specifically wasn't looking for:
- Cutting-edge performance that rivals GPT-4 or Claude
- Maximum possible speed at any cost
- The ability to run 70B+ models (yet)
This was about "good enough to be useful" for iteration work, not "best possible performance for production workloads."
Hardware Decision: Why AMD Ryzen AI Max 395 (Strix Halo)
I evaluated four options:
1. Gaming Desktop
I already have a gaming tower for my main work. Building another one felt redundant, and even high-end consumer GPUs cap out at 16-24GB VRAM. Once you load a 20B model and try to give it a decent context window, you're hitting that ceiling fast.
Verdict: Redundant hardware, VRAM-limited, power-hungry.
2. Apple Mac Studio
Unified memory architecture is perfect for LLMs - the GPU can access all system RAM. Polished experience, quiet, power-efficient. And there are quite a few LLM models optimized for Apple hardware through MLX, so the ecosystem isn't as limited as I initially thought.
But the Apple Tax was too steep for a machine that would essentially sit on a shelf in my office running background tasks.
Verdict: Great hardware, close runner-up, but too expensive for this use case.
3. Nvidia DGX Spark
Purpose-built for AI, mature ecosystem, strong driver support. Also $$$$, still discrete GPU with VRAM ceiling, and overkill for my "good enough for iteration" goal.
Verdict: Solving a bigger problem than I have.
4. AMD Ryzen AI Max 395 (Strix Halo)
Small form factor workstation with unified memory architecture (up to 128GB), active optimization community, and room for the ecosystem to improve.
There are a few different makers producing Strix Halo machines. The Framework workstation gets excellent reviews and is highly regarded in the community, but I'm impatient - the HP Z2 Mini G1a was in stock at Microcenter and I could walk out with it that day. Plus, I didn't find anything negative about the HP, and the price was as reasonable as anything with a chip in it these days.
Verdict: This is the one.
Why Strix Halo Made Sense
The unified memory advantage: Like the Mac Studio, the GPU can access all system RAM. No VRAM ceiling. I can run a 20B model with a 16K context window comfortably in 64GB, with headroom to scale up.
Performance was comparable: For the 8-34B models I'd actually be running, token-per-second performance across Mac Studio, DGX Spark, and Strix Halo wasn't drastically different. I wasn't giving up meaningful speed by choosing Strix Halo.
Form factor and power: Mini workstation that fits on a desk, quiet operation, way less power than another gaming tower. I could even rack-mount multiple units later if I wanted to cluster for larger models.
The ecosystem bet: This is the part that mattered most to me. I'm not betting on Strix Halo because it's the fastest today. I'm betting on it because:
- The optimization trajectory is still steep (community actively improving performance)
- Active tooling development (kyuz0's toolboxes, llama.cpp AMD support improving)
- People already experimenting with clustering for 70B+ models
- Early adopter phase means I'm on the improvement curve, not at the plateau
Apple's ecosystem has good optimization through MLX, but it's locked and fairly static. Nvidia's ecosystem is mature but expensive and mostly optimized already. Strix Halo is early enough that the community is still figuring out how to squeeze more performance out of it.
Getting It Running: The Software Stack
This is where things got interesting. The hardware is plug-and-play, but getting a local LLM server accessible across my network took some assembly.
What I was aiming for:
- llama.cpp running models locally
- Open WebUI as the interface (think ChatGPT-style web interface)
- Accessible at
llm.theemdash.aifrom any device on my network - SSL certificates so browsers don't complain
The sources I pieced together:
- kyuz0's YouTube series on Strix Halo optimization
- kyuz0's Toolboxes repository with pre-built containers
- TechnigmaAI's GTT optimization guide for memory tuning
- Open WebUI documentation for the interface
- A lot of trial and error
The gap: Nobody had documented how to wire all these pieces together into a complete, network-accessible system. Each guide covered one piece brilliantly, but the integration wasn't documented anywhere.
The Stack (High Level)
Here's what I ended up with:
Browser → llm.theemdash.ai (HTTPS)
↓
Nginx reverse proxy (with SSL)
↓
Open WebUI (web interface)
↓
llama-server (running in container)
↓
Strix Halo GPU
Layer 1: Hardware optimization - Followed TechnigmaAI's guide to optimize the Graphics Translation Table (GTT) memory for better iGPU performance. This was a "follow the instructions exactly" situation - I don't fully understand the kernel tuning, but it makes the memory handling better for larger models.
Layer 2: Container setup - Used distrobox (not toolbox despite what some docs say - GPU passthrough doesn't work properly in toolbox). Pulled kyuz0's pre-built container images that have llama.cpp already compiled with the right optimizations.
Layer 3: llama-server - Running inside the container, serving models on port 10000. Critical detail: use --host 0.0.0.0 not --host 127.0.0.1 or the container networking won't work.
Layer 4: Open WebUI - Separate container providing the ChatGPT-style interface, running on port 3000, connecting to llama-server.
Layer 5: Nginx - Reverse proxy with SSL certificates from Let's Encrypt, making everything accessible at a clean domain name.
The Gotchas I Hit
1. toolbox vs distrobox
The documentation says use toolbox for containers. GPU passthrough doesn't work in toolbox on Ubuntu 24.04. Use distrobox instead. Same commands, just swap the tool name.
2. Host binding for llama-server
If you use --host 127.0.0.1, the Open WebUI container can't reach llama-server. Use --host 0.0.0.0 to make it accessible from other containers.
3. Nginx streaming configuration
Without proper streaming config, every model response fails with a JSON parsing error. You need:
proxy_buffering off;
proxy_cache off;
proxy_http_version 1.1;
This took me hours to figure out. The error message (Unexpected token 'd', "data: {ch"... is not valid JSON) was completely unhelpful.
What I Ended Up With
I'm currently running Qwen3-Coder-32B-Instruct (32 billion parameter model, Q4_K_M quantization).
Once I could chat with it, that was good enough for the weekend. Now I could start playing with different models and digging into which ones actually performed well for my work - but that's a story for the next article.
What this gives me:
- Unlimited iteration space for working through ideas
- No rate limits, no usage meters, no subscription tier anxiety
- Network accessible from any device in my house
- Privacy for sensitive work (nothing leaves my network)
- A foundation to build on - I can swap models, experiment with different sizes, eventually cluster multiple units if needed
What this doesn't replace:
- Claude and Gemini for final quality work
- The polish and capabilities of frontier models
- The convenience of just opening a web browser
Think of it as a workshop - it's where you do the messy iteration work before taking your refined output to the production environment.
Was It Worth It?
For me? Yes, but with caveats.
This made sense because:
- I was already hitting rate limits regularly
- I like tinkering with tech (getting new hardware is always a good day)
- The $3,000 investment prevents subscription creep and API bill anxiety
This probably doesn't make sense if:
- You're happy with free tier limits
- You don't mind paying for top-tier plans (Claude Pro, Gemini Advanced)
- You're comfortable with pay-as-you-go API usage and the potential costs
- You only use AI tools occasionally
- You don't have the technical comfort to troubleshoot container networking and reverse proxies
- $3,000 is a significant expense for a "nice to have" tool
The real value isn't in replacing the foundation models - it's in removing the mental overhead of "am I using too many tokens?" so I can actually think freely during the messy iteration phase.
And yeah, getting to play with Strix Halo hardware and contribute real-world usage data to an emerging ecosystem? That's a bonus.
Member discussion: