Local LLM Setup Was Step One. Finding the Right Model Was Step Two.

In my last post, I walked through setting up a local LLM on an HP Z2 Mini. I got Qwen3-Coder-32B-Instruct running and called it success.

Then I actually tried to use it. [Cue lab coat and clipboard]

The Test That Broke Everything

I gave Qwen a simple task: "Here are four documents about my Z2 setup. Can you summarize what we discussed?"

Qwen's response: "There are no documents. I need to see them."

The documents were right there. Uploaded and visible in the interface. But the state-of-the-art, 32-billion parameter model hallucinated that nothing existed.

I needed something that actually worked, so I switched to GPT-OSS-20B. Smaller model, older architecture, but it had no problem reading the documents and providing a detailed summary.

Not exactly scientific testing. No controlled variables, no repeated trials, just a quick "does this work?" check. But it was meaningful because it forced me to ask the right question.

The Question I Should Have Asked First

Even if I got Qwen working perfectly, even if I optimized it to run faster than GPT-OSS... I already pay for GPT-4 through ChatGPT. [Facepalm moment]

The models available through OpenAI's API include GPT-4, which handles the same tasks as GPT-OSS-20B. So why was I spending time running a local version of essentially the same capability?

The unlimited queries? Nice, but not the real value.

The real question wasn't "can I run models locally?" I'd already answered that. The real question was: "What can I do locally that I CAN'T do with my existing subscriptions?"

What I Actually Need

My work isn't writing code agents or debugging Python. It's strategic planning and developing frameworks. Thinking through how organizations should approach technology adoption. The consulting and advisory side of AI enablement.

I needed a model optimized for that kind of work - not coding, not general chat, but strategic reasoning and planning.

So I did what anyone would do: I asked an AI.

Finding Mistral

I asked Gemini: "What model would you recommend for consulting and strategic planning work?"

The answer: Mistral-Small-3.2-24B.

Designed for reasoning, good at extended thinking, strong performance on strategic tasks. And here's the key part - not available through ChatGPT, Claude, or Gemini subscriptions.

This was the model I actually needed. Not because it was "better" on some leaderboard, but because it aligned with my actual use case AND I couldn't access it any other way.

The Optimization Problem

Here's where I hit a practical issue: I wanted to run Mistral locally, but I had no idea if my Z2 Mini could handle it well enough to be useful.

I relied heavily on kyuz0's benchmark data. His interactive grid at kyuz0.github.io/amd-strix-halo-toolboxes/ shows performance metrics for different models across different backends (ROCm 6.4.4, ROCm 7.1.1, Vulkan variants).

Problem: Mistral-Small-3.2-24B wasn't on his list.

Solution: Test it myself using his containers.

Let me be clear about something: I'm not a benchmarking expert. I didn't design sophisticated tests or tune parameters. I just ran the basic llama-bench command across each of kyuz0's pre-built containers to see which one performed best.

kyuz0's contribution was building optimized containers for Strix Halo. My contribution was running a model through them that he hadn't tested yet. Thank you, kyuz0, for doing the hard work so I didn't have to.

The Testing Process

I had four container options available:

ROCm 6.4.4
ROCm 7.1.1
Vulkan (RADV)
Vulkan (AMDVLK)

Each one is a different backend for running models on AMD hardware. Based on kyuz0's data, ROCm 6.4.4 crushed everything for most models. But would that hold true for Mistral?

The test command:

llama-bench -m Mistral-Small-3.2-24B-Q4_K_M.gguf -p 2048 -n 128 -fa 1

Breaking that down:

-p 2048: Simulates reading a document (prompt processing)
-n 128: Simulates writing a response (text generation)
-fa 1: Enables flash attention (kyuz0 says never omit this)

I ran this same command in each container and recorded the results.

The Results

Backend	Prompt Processing (t/s)	Text Generation (t/s)
ROCm 6.4.4	400.74	14.42
ROCm 7.1.1	400.15	14.43
Vulkan RADV	265.06	15.07
Vulkan AMDVLK	91.59	15.08

Key findings:

ROCm 6.4.4 and 7.1.1 essentially tied - within 0.15% of each other for prompt processing, identical for text generation. For Mistral specifically, the newer ROCm version caught up to the older one.

ROCm crushed Vulkan RADV - 51% faster at prompt processing (reading documents). Text generation was only 4% slower, which is negligible.

Vulkan AMDVLK was terrible - 77% slower than ROCm for prompt processing with no meaningful benefit anywhere else. Avoid this backend. Just... no.

What These Numbers Actually Mean

When you're using a local LLM for interactive work, there are two phases:

Prompt processing: The model reading your input (documents, context, instructions)
Text generation: The model writing its response

For chat and document work, prompt processing speed matters more than you'd think. It's the difference between "instant response" and "waiting a few seconds." That latency to first token creates the perception of speed.

Text generation being 4% slower? Barely noticeable. You're reading the response as it streams anyway.

The verdict: ROCm 6.4.4 for Mistral-Small-3.2-24B on Strix Halo.

What I Actually Gained

I now have a model running locally that:

Handles strategic reasoning and planning work
Isn't available through ChatGPT, Claude, or Gemini
Runs fast enough to be genuinely useful (400+ tokens/sec prompt processing)
Costs nothing per query once it's running

This is the actual value of local LLM deployment. Not unlimited queries to models I already pay for elsewhere. Access to capabilities I can't subscription my way to.

The Bigger Pattern

Looking back at both posts, there's a consistent theme:

Step 1: Chose hardware (Strix Halo) based on architecture advantages and ecosystem trajectory, not peak benchmark numbers.

Step 2: Chose model (Mistral) based on my actual work needs and unique access, not what leaderboards say is "best."

Both decisions were pragmatic. Both used available data (kyuz0's benchmarks) without pretending to be expert-level performance engineers. Both focused on "what do I actually need?" instead of "what's theoretically optimal?"

The Resources That Made This Work

kyuz0's benchmark grid: kyuz0.github.io/amd-strix-halo-toolboxes/
Essential for understanding backend performance differences.

kyuz0's toolbox repository: github.com/kyuz0/amd-strix-halo-toolboxes
Pre-built containers saved me from ROCm configuration hell.

Mistral AI documentation: For understanding what the model was designed to do and how to configure it properly.

Gemini: For pointing me toward the right model in the first place.

Step 1 got the hardware working. Step 2 made it useful.

Am I done optimizing this setup? Probably not. But if I find ways to push this further, I'll write about it. No cliffhangers, just whatever actually works next.

Since writing this, I've migrated from Ubuntu 24.04 to Fedora 43. Ubuntu's LTS status means packages and kernels lag in favor of stability which is the opposite of what I needed for bleeding-edge ROCm and AI stack components. Fedora's rolling release cycle aligns better with kyuz0's latest optimizations and the pace of development in this space.