Qwen 3.6 35a3 - AI By Nick

Nick AntonaccioAdmin
May 09, 2026 at 17:23 (edited, 3 revisions)

I'm really excited to see what the Qwen 3.6 35a3 MOE model can do for coding and agentic tasks on small consumer GPUs.

On my laptop with the mobile RTX 3080ti 16GB VRAM, using LM Studio default settings for the Cuda runtime, this model ran at:

13 tokens per second with 8 bit quantization
16 tokens per second with 6 bit quantization (6 bit is likely basically as reliable as 8 bit for most purposes)
24 tokens per second with 4 bit quantization

On my Strix Halo machines, using LM Studio default settings for the Vulkan 2.13.0 runtime:

46 tokens per second with 8 bit quantization

At that speed, there's hardly a need to use a more quantized version (Strix Halo is really turning out to be a great machine for the money, keeping in mind that it can handle much bigger models than Qwen 3.6 35a3).

Running a knowledge query with Qwen 3.6 35a3 connected to the Internet (in Jan), yielded truly great results. This little model can do very impressive research.

Disconnected from the Internet, Qwen 3.6 35a3 is not a 'world knowledge' model, but it immediately seems to be better than Qwen 3.5 35a3 at writing code, and my first tests seem to indicate it also generally does better than Gemma4 31b dense and 26b MOE for text based tasks. I have a sense this may be the current best all-around task model for small local GPUs, especially for tasks that involve writing code. As a big bonus, Qwen3.6:35a3 also supports images, audio, and video (though Qwen's Omni models have deeper multi-modal capabilities).

Over the next few weeks, I'm most inclined to test the Qwen 3.6 35a3 model, together with Pi & Nullclaw, on local light-weight GPUs. I'll put it up head to head against Gemma 4 31b and 26b, as well as GPT-OSS:120b and 20b. Those are the current leading players in the small GPU LLM market, and I'm excited to see some very strong models that can run on sub-$1000 used laptops.

BTW, for 'world knowledge' on a small GPU, Nemotron 3 Super at IQ3_XXS is an impressive little self-contained encyclopedia (it only runs at 4.5 tps on that 3080ti mobile, but that's usable for knowledge lookup). GPT-OSS:120b (11.5 tps on the 3080ti mobile) and heavily quantized Minimax (iq3_s) are also good little knowledge LLMs on GPUs such as Strix Halo and DGX Spark. And of course, even an older small model like GPT-OSS:20b can do a great job researching knowledge, if it has access to the Internet (GPT-OSS:20b is blazing fast on those small GPUs and can do an impressive job with web research).

History

Nick AntonaccioAdmin
Apr 27, 2026 at 04:52

Update: the Qwen 3.6 35a3 and Gemma 4 26a4 MOE models have become my workhorse self-hosted LLMs for local software development.

This example was completed as a single task, entirely with Qwen 3.6 35a3 on a laptop with only a mobile 3080 with 16GB VRAM:

http://1y1z.com:5993

This was done with Qwen 3.6 35a3 and Gemma 4 26a4 MOE on a Strix Halo laptop:

http://1y1z.com:5994

Nick AntonaccioAdmin
May 09, 2026 at 17:13 (edited, 2 revisions)

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gmma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything but the smallest workloads. 27b is particularly useful when you use it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

https://com-pute.com/nick/3d_game_qwen36_35a3_strix_halo.html
http://1y1z.com:8284/dashboard-website--qwen36-35a3--strix/ (a little dashboard/web site demo)

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

https://com-pute.com/nick/3d_game_qwen36_35a3_3080_16Gb.html
http://1y1z.com:8284/nexora/ (an additional tiny web site layout demo)
https://com-pute.com/nick/ui_controls_qwen36-35a3_3080_16Gb.html

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, and many other models which require much beefier hardware. For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce (fine as a start, but you'd want to put in lots of additional work to polish up styling for a public facing web site):

http://1y1z.com:8284/demo-website--gpt120--pi/

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. Here's an example made by mimo25pro (via Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

Those are pure visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

Qwen tends to best even the local Gemma 4 models in both quality and performance speed, but the Gemma 4 versions (MOE 26a4b and dense 31b) are also no slouches. Here's a little dashboard/web site example:

http://1y1z.com:8284/flashy-site--gemma4-26ba4/

I keep Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, to those larg/slow alternatives.

History

Please login to post a reply.