I love the ASUS ROG Flow Z13 Strix Halo Laptop

364 views
Nick Antonaccio
Nick AntonaccioAdmin
May 09, 2026 at 03:09 (edited, 9 revisions)
#1

NOTE from future Nick: the Asus GX10 (DGX Spark) is better for agentic workflows. See https://aibynick.com/thread/31 TL;DR: Apple Mac laptops cost at least twice as much as Strix Halo laptops, for the same 128Gb shared memory, so the price/perfomance of the ASUS ROG Flow Z13 is hard to beat. But if you're building a self-hosted GPU server for long context workflows, the prompt processing speed of the Nvidia GB10 DGX Spark machines, and the ability to cluster multiple machines easily, is worth the extra $1000.

I got another ASUS ROG Flow Z13 laptop today. For $2,572.59 on Amazon, you get a complete AMD Ryzen AI MAX+ 395 (Strix Halo) machine - not just a GPU - with 128GB LPDDR5X 8000MHz shareable RAM. That price is for a 'used like new' machine (which is what I bought), or you can buy a pristine new one for $2707, but I think totally new is just a wasted few hundred dollars (new is still more than $1000 less than any machine that can run anything close to comparable sized LLM models).

For that price I'm willing to deal with the small 1TB PCIe Gen 4 SSD. That's big enough to load up all the models I want to use.

Out of the box, this thing runs some really capable LLM models (70b dense and 120b MOE class) quickly enough to be very useful for real work. The laptop is small and light enough to carry easily in a little backpack or briefcase, for totally offline inference tasks while traveling, or anywhere no Internet is available.

Just download LM Studio, Jan, Ollama, or KoboldCPP, etc.

For coding and agentic tasks, download the following models - you can run these all in 8K quantization on the Strix-Halo!:

  • Gemma4: 31B dense and 26B MOE models (and smaller versions for speed)

  • Qwen 3.5: 122B-A10B MOE and 27B dense models (and 35B-A3B MOE for speed)

  • GPT-OSS: 120b (and 20b for speed)

For knowledge work, try highly quantized versions of Nemotron Super and Minimax 2.5.

You're able to work completely offline with those models. They're legitimately useful, not just toys.

nickantonaccio
nickantonaccio
Apr 07, 2026 at 19:18 (edited, 3 revisions)
#2

Here are some notes from the old rebolforum related to the ASUS RORG Flow Z13, in comparison to other machines I have with little consumer Nvidia GPUs:

2026-03-19 23:15:39

I've got a couple laptops with RTX 3080ti mobile GPUs (16Gb VRAM), which run all the commonly used smaller local models such as GPT-OSS:20b just as well as my tower with an RTX 3090.

My real current favorite machine, though, is a little Strix Halo laptop with 128GB sharable RAM. There are piles of these ASUS ROG Flow Z13s available for around $2500 (NEW, even with the current RAM shortage prices!). There's no machine close to this price range which can run such big models (for example, dense models in the 70 billion parameter range, and MOE models in the 120 billion range). You can bring one of these little laptops on an airplane, or use it while camping, or anywhere without Internet access, to do some serious inference work. Even a model like Qwen3.5-122B-A10B gets 20+ tokens per second on this machine, in low power mode. That's an amazing amount of power for far less money than anything else!

I think they've gotten a bad rap because when they first came out the software drivers weren't ready for mainstream, and people had problems, so now there are tons of bad reviews online. But that's not the current situation. Get it out the box, install any of the common inference apps (LM Studio, Ollama, Jan, Koboldcpp, etc.), download some models, and it just works - at about 1/4 price of a single RTX 6000 pro GPU alone (not a full server machine, just the GPU), for the same amount of GPU memory. You can buy small form factor Strix Halo desktop units, but those are getting to be more expensive than the ASUS ROG Flow Z13, without a monitor, kb, mouse, etc. - and not a mobile laptop unit, which is how this really shines (real disconnected inference capability while you travel away from Internet).

2026-03-21 06:32:26

Compared to any of my machines with Nvidia GPUs, the Strix Halo runs small models more slowly. For example, qwen3.5:35b-a3b 4bit runs at ~70 tokens per second on my RTX 3090, and ~50 tokens per second on the Strix Halo. My Strix Halo laptop is also a bit slower to load a new model, but I think that's because the ASUS ROG Flow Z13 likely has a less performant hard drive (the ROG laptop model I'm using only has 1TB HD, the Nvidia machines all have at least 2Tb HD). In practice, everything about the Strix Halo is snappy. 50 tps is not slow by any means, and the real benefit is that you can comfortably run bigger models like MOE models with 100 billion+ parameters, and/or smaller models at higher bit precision.

Running qwen3.5:122b-10b at 20 tps + on a portable local machine which sips power, is fantastic. I've even run Minimax 2.5 and Nemotron Super at very low precision, and they've got surprising amount of useful knowledge, even at those very low precisions.

I'm thinking of getting a few more of the Strix Halo machines just because prices are going up so dramatically on all the other options. I'd have to buy a used laptop with an Nvidia GPU that has 16-24 Gb VRAM, for the same price as a new Strix Halo laptop that has 128 Gb unified memory. For me that battle is easily won by the Strix Halo. The Apple machine are at least twice as much for something in the same ballpark (and I'm not a bic Mac fan).

2026-03-21 13:10:00

I haven't tried it yet, but I think Nvidia GPUs will likely perform much better in situations where more than one user is performing inference simultaneously (meaning, I woulldn't plan to build a multi-user inference server with a Strix Halo machine).

2026-03-21 15:09:43

I should point out that the Strix Halo has 256 GB/s memory bandwidth, Apple M3 Max has 400 GB/s, and M4 Max has 546 GB/s (beware that the Apple Max versions are far mor performant than the Pro versions of the same name). The DGX Spark has 273 GB/s, but it tends to perform about twice as fast as Strix Halo because it has specialized hardware accelerators for 4bit and 8bit quants, more mature software (Cuda vs ROCm), more powerful prefill processing, it performs faster image generation, etc.

I think everybody just looks at all those numbers and figures Strix Halo can't realistically be very useful, but for the price, I'm very surprised at how well that little ASUS ROG Flow Z13 laptop can do some actually productive inference work. The new 3.5 models from Qwen really help to make it more useful, and I'm expecting smarter and more reliable smaller models to continue to be developed, which should make less powerful hardware more and more useful in general. The 128Gb RAM is nowhere near big enough for frontier models like Kimi or GLM, but Qwen3.5-122B-A10B is pretty darn smart for agentic tool calling roles, and Qwen3-Coder-Next can get a lot of actual coding tasks completed, and the super-quantized versions of Minimax 2.5 and Nemotron Super contain an amazing amount of knowledge - even obscure info.

For example, I can submit queries to very low precision Minimax and Nemotron versions, about important people in paramotoring, popular wing and engine brands, and questions about jam.py, which none of the smaller models know much about, if anything, and those very quantized versions are approach the sorts of expectations we've gotten used to with all-knowing trillion parameter frontier models. You'll never be able to fit all human knowledge into a 200 billion parameter model, but it's very impressive how a low quant on a very large model like Minimax 2.5 enables lots of information to be stored in models that are less than 80Gb on disk.

Smaller models still need at least 4bit quantization, and those same models really make a lot fewer mistakes at 8bit quantization. I typically expect a larger parameter model to generally perform better at 4bit quant than a smaller model at 8bit quant, given the same size on disk. That's why I'm happy with the Strix Halo - it can run the models that are currently very reliable - especially 100+ parameter MOE models at 4bit. That's a useful sweet spot.

By the way, the good old standby GPT-OSS:20b is still very useful for gathering information with web search, and performing tool calls - even on smaller GPUs. That thing is a workhorse for cheap GPUs. I want to try it with the Intel Arc Pro B50 16GB GPU that costs ~$350 (in fact I'd love to build a system with 4 of those GPUs, if it turns out they work well together in current Llamacpp). But I think qwen3.5:35b-a3b 4bit and others coming are stronger contenders for all the little consumer GPUs.

Nick Antonaccio
Nick AntonaccioAdmin
Apr 10, 2026 at 14:40
#3

Here's a video review of this laptop:

https://www.youtube.com/watch?v=49AMhhzVJiw

Nick Antonaccio
Nick AntonaccioAdmin
Apr 10, 2026 at 14:50
#4

This machine has been around quite a while (that review video was posted 11 month ago), but they haven't gone up as much in price as other options - and it was already a great price/performance option back before the RAM apocalypse. I think most people are just looking away from anything that isn't Nvidia.

Nick Antonaccio
Nick AntonaccioAdmin
Apr 10, 2026 at 21:39 (edited, 1 revision)
#5
Nick Antonaccio
Nick AntonaccioAdmin
Apr 12, 2026 at 18:18 (edited, 2 revisions)
#6

A few things about the basics of using Strix Halo:

  • Memory usage is shared between CPU and GPU. Depending on your system, you can set these settings in BIOS and/or in software that ships with your unit. I typically use the ASUS Armoury Crate software to set GPU usage to 96Gb (this shows up as ~108GB GPU VRAM in LM Studio).

  • MOST IMPORTANT: Use ROCm llama.cpp instead of Vulkan, if you have any problems with Vulkan. I've noticed that ROCm typically works much faster, can access more GPU memory, and has fewer issues in general. In LM Studio, click Settings -> Runtime -> download the ROCm runtime, and select it to be used for GGUF.

  • Reduce the default number of layers offloaded to GPU, if a model doesn't load (I tend to limit GPU use to 66Gb or less).

  • I've heard that disabling 'Try nmap()' is sometimes necessary if you have issues loading a model

  • I also updated drivers from https://www.amd.com/en/support/download/drivers.html (not sure if this is needed, but the system sent me a notification to do that)

Nick Antonaccio
Nick AntonaccioAdmin
Apr 17, 2026 at 21:28 (edited, 2 revisions)
#7

Testing the most recent version of Vulkan in LM Studio on the Strix Halo, the previous issues seem to be fixed, and now performance is actually better with it, than with the ROCm runtime, for some models. I expect issues like this will continue to evolve!

Nick Antonaccio
Nick AntonaccioAdmin
May 06, 2026 at 15:19
#8

I'm now using the Vulkan run time for all models. It's currently significantly faster than ROCm, and has been 100% stable in the most recent releases.

Please login to post a reply.

© 2026 AI By Nick.