Anyway. I wouldn't recommend following the steps posted in there. Poke around google, or ask your friendly neighborhood LLM for some advice on how to set up your Strix Halo laptop/desktop for the tasks described. A good resource to start with would probably be the unsloth page for whichever model you are trying to run. (There are a few quantization groups that are competing for top-place with gguf's, and unsloth is regularly at the top-- with incredible documentation on inference, training, etc.)
Anyway, sorry to be harsh. I understand that this is just a blog for jotting down stuff you're doing, which is a great thing to do. I'm mostly just commenting on the fact that this is on the front page of hn for some reason.
For instance, now the majority of desktops with DDR5 have 4 channels, not 2 channels, but the channels are narrower, so the width of the memory interface is the same as before.
To avoid ambiguities, one should always write the width of the memory interface.
Most desktop computers and laptop computers have 128-bit memory interfaces.
The cheapest desktop computers and laptop computers, e.g. those with Intel Alder Lake N/Twin Lake CPUs, and also many smartphones and Arm-based SBCs, have 64-bit memory interfaces.
Cheaper smartphones and Arm-based SBCs have 32-bit memory interfaces.
Strix Halo and many older workstations and many cheaper servers have 256-bit memory interfaces.
High-end servers and workstations have 768-bit or 512-bit memory interfaces.
It is expected that future high-end servers will have 1024-bit memory interfaces per socket.
GPUs with private memory have usually memory interfaces between 192-bit and 1024-bit, but newer consumer GPUs have usually narrower memory interfaces than older consumer GPUs, to reduce cost. The narrower memory interface is compensated by faster memories, so the available bandwidth in consumer GPUs has been increased much slower than the increase in GDDR memory speed would have allowed.
They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.
For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:
In short, even lower quants leave some layers at original precision and llama.cpp in its endless wisdom does not do any conversion when loading weights and seeing what your card supports, so every time you run inference it gets so surprised and hits a brick wall when there's no bf16 acceleration. Then it has to convert to fp16 on the fly or something else which can literally drop tg by half or even more. I've seen fp16 models literally run faster than Q8 on Arc despite being twice the size with the same bandwidth and it's expectedly similar [0] on AMD.
Models used to be released as fp16 which was fine, then Gemma did native bf16 and Bartowski initially came up with a compatibility thing where they converted bf16 to fp32 then fp16 and used that for quants. Most models are released as bf16 these days though and Bartowski's given up on doing that (while Unsloth never did that to begin with). So if you do want max speed, you kinda have to do static quants yourself and follow the same multi-step process to remove all the stupid bf16 weights from the model. I don't get why this can't be done once at model load ffs, but this is what we've got.
[0] https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_st...
I do not know about its GPU, which might have only FP16.
So it is likely that the right inference strategy would be to run any BF16 computations on the Strix Halo CPU, while running the quantized computations on its GPU.
[0] https://www.amd.com/en/developer/resources/technical-article...
Call me traditional but I find it a bit scary for my BIOS to be connecting to WiFi and doing the downloading. Makes me wonder if the new BIOS blob would be secure i.e. did the BIOS connect over securely over https ? Did it check for the appropriate hash/signature etc. ? I would suppose all this is more difficult to do in the BIOS. I would expect better security if this was done in user space in the OS.
I'm much prefer if the OS did the actual downloading followed by the BIOS just doing the installation of the update.
The BIOSes recent enough to be able to connect through the Internet normally have the option to use a USB memory from inside the BIOS setup.
Some motherboards can update the BIOS from a USB memory even without a CPU in the socket.
If something is "standard" nowadays does it mean it is the right way to go ?
One of my main issues is that this means your BIOS has to have a WiFi software stack in it, have a TLS stack in it etc. Basically millions of lines of extra code. Most of it in a blob never to be seen by more than a few engineers.
Though in another a way allowing BIOS to perform self updates is good because it doesn't matter if you've installed FreeBSD, OpenBSD, Linux, Windows, <any other os> you will be able to update your BIOS.
- gemma4-31b normal q8 -> 5.1 tok/s
- gemma4-31b normal q16 -> 3.7 t/s
- gemma4-31b distil q16 -> 3.6 t/s
- gemma4-31b distil q8 -> 5.7 tok/s (!)
- gemma4-26b-a4b ud q8kxl -> 38 t/s (!)
- gemma4-26b-a4b ud q16 -> 12 t/s
- gemma4-26b-a4b cl q8 -> 42 t/s (!)
- gemma4-26b-a4b cl q16 -> 12 t/s
- qwen3.5-35b-a3b-UD@q6_k -> 52 t/s (!)
- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@q8_0 -> 34 tok/s (!)
- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@bf16 -> 11 tok/s
- qwen3.5-27b-claude-4.6-opus-reasoning-distilled-v2 q8 -> 8 tok/s
- qwen3.5 122B A10B MXFP4 Mo qwen3.5-122b-a10b (q4) -> 11 tok/s
- qwen3.5-122b-a10b-uncensored-hauhaucs-aggressive (q6) -> 10 tok/s
prompt eval time = 315.66 ms / 221 tokens ( 1.43 ms per token, 700.13 tokens per second)
eval time = 1431.96 ms / 58 tokens ( 24.69 ms per token, 40.50 tokens per second)
total time = 1747.62 ms / 279 tokens
With reasoning enabled, it's about a quarter or fifth of that performance, quite a lot slower, but still reasonably comfortable to use interactively. The dense model is even slower. For some reason, Gemma 4 is pretty slow on the Strix Halo with reasoning enabled, compared to similar other models. It reasons really hard, I guess. I don't understand what makes models slower or faster given similar sizes, it surprised me.
Qwen 3.5 and 3.6 in the similar sized MoE versions at 8-bit quantization are notably faster on this hardware. If I were using Gemma 4 31B with reasoning interactively, I'd use a smaller 6-bit or even 5-bit quantization, to speed it up to something sort of comfortable to use. Because it is dog slow at 8-bit quantization, but shockingly smart and effective for such a tiny model.
Edit: Here's some benchmarks which feel right, based on my own experiences. https://kyuz0.github.io/amd-strix-halo-toolboxes/
It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.
The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.
ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.
You can use Vulkan instead of ROCm on Radeon GPUs, including on the Strix Halo (and for a while, Vulkan was more likely to work on the Strix Halo, as ROCm support was slow to arrive and stabilize), but you need something that talks to the GPU.
Current ROCm, 7.2.1, works quite well on the Strix Halo. Vulkan does, too. ROCm tends to be a little faster, though. Not always, but mostly. People used to benchmark to figure out which was the best for a given model/workload, but now, I think most folks just assume ROCm is the better choice and use it exclusively. That's what I do, though I did find Gemma 4 wouldn't work on ROCm for a little bit after release (I think that was a llamma.cpp issue, though).
But we already have software that talks to the GPU; mesa3d and the ecosystem around that. It has existed for decades. My understanding was that the main reasons not to use it was that memory management was too complicated and CUDA solved that problem.
If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?
CUDA is a proprietary Nvidia product. CUDA solved the problem for Nvidia chips.
On AMD GPUs, you use ROCm. On Intel, you use OpenVINO. On Apple silicon you use MLX. All work fine with all the common AI tasks you'd want to do on self-hosted hardware. CUDA was there first and so it has a more mature ecosystem, but, so far, I've found 0 models or tasks I haven't been able to use with ROCm. llama.cpp works fine. ComfyUI works fine. Transformers library works fine. LM Studio works fine.
Unless you believe Nvidia having a monopoly on inference or training AI models is good for the world, you can't oppose all the other GPU makers having a way for their chips to be used for those purposes. CUDA is a proprietary vendor-specific solution.
Edit: But, also, Vulkan works fine on the Strix Halo. It is reliable and usually not that much slower than ROCm (and occasionally faster, somehow). Here's some benchmarks: https://kyuz0.github.io/amd-strix-halo-toolboxes/
That has been one of the big themes in GPU hardware since around 2010 era when AMD committed to ATI. Nvidia tried to solve the memory management problem in the software layer, AMD committed to doing it in hardware. Software was a better bet by around a trillion dollars so far, but if the hardware solutions have finally come to fruit then why the focus on ROCm?
And the memory barriers? How do you sync up the L1/L2 cache of a CPU core with the GPU's cache?
Exactly. With a ROCm memory barrier, ensuring parallelism between CPU + GPU, while also providing a mechanism for synchronization.
GPU and CPU can share memory, but they do not share caches. You need programming effort to make ANY of this work.
I'll give a specific example in my feedback, You said:
``` so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window ```
But there are no numbers, results or output paste. Performance, or timings.
Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.
Do you mind sharing these? Thanks!