Qwen 3.6 35a3

68 views
Nick Antonaccio
Nick AntonaccioAdmin
Apr 20, 2026 at 18:09 (edited, 2 revisions)
#1

I'm really excited to see what the Qwen 3.6 35a3 MOE model can do for coding and agentic tasks on small consumer GPUs.

On my laptop with the mobile RTX 3080ti 16GB VRAM, using LM Studio default settings for the Cuda runtime, this model ran at:

  • 13 tokens per second with 8 bit quantization
  • 16 tokens per second with 6 bit quantization (6 bit is likely basically as reliable as 8 bit for most purposes)
  • 24 tokens per second with 4 bit quantization

On my Strix Halo machines, the 8 bit quant ran at 46 tokens per second, using LM Studio default settings for the Vulkan 2.13.0 runtime (Strix Halo is really turning out to be a great machine for the money, keeping in mind that it can handle much bigger models than Qwen 3.6 35a3).

Running a knowledge query with Qwen 3.6 35a3 connected to the Internet (in Jan), yielded truly great results. This little model can do very impressive research.

Disconnected from the Internet, Qwen 3.6 35a3 is not a 'world knowledge' model, but it immediately seems to be better than Qwen 3.5 35a3 at writing code, and my first tests seem to indicate it also generally does better than Gemma4 31b dense and 26b MOE for text based tasks. I have a sense this may be the current best all-around task model for small local GPUs, especially for tasks that involve writing code. As a big bonus, Qwen3.6:35a3 also supports images, audio, and video (though Qwen's Omni models have deeper multi-modal capabilities).

Over the next few weeks, I'm most inclined to test the Qwen 3.6 35a3 model, together with Nullclaw, on local light-weight GPUs. I'll put it up head to head against Gemma 4 31b and 26b, as well as GPT-OSS:120b and 20b. Those are the current leading players in the small GPU LLM market, and I'm excited to see some very strong models that can run on sub-$1000 used laptops.

BTW, for 'world knowledge' on a small GPU, Nemotron 3 Super at IQ3_XXS is an impressive little self-contained encyclopedia (it only runs at 4.5 tps on that 3080ti mobile, but that's usable for knowledge lookup). GPT-OSS:120b (11.5 tps on the 3080ti mobile) and heavily quantized Minimax are also good little knowledge LLMs on small GPUs. And of course, even GPT-OSS:20b can do a great job researching knowledge, if it has access to the Internet (GPT-OSS:20b is blazing fast on those small GPUs and can do an impressive job with web research).

Nick Antonaccio
Nick AntonaccioAdmin
Apr 27, 2026 at 04:52
#2

Update: the Qwen 3.6 35a3 and Gemma 4 26a4 MOE models have become my workhorse self-hosted LLMs for local software development.

This example was completed as a single task, entirely with Qwen 3.6 35a3 on a laptop with only a mobile 3080 with 16GB VRAM:

http://1y1z.com:5993

This was done with Qwen 3.6 35a3 and Gemma 4 26a4 MOE on a Strix Halo laptop:

http://1y1z.com:5994

Nick Antonaccio
Nick AntonaccioAdmin
May 09, 2026 at 15:56 (edited, 1 revision)
#3

I added some more demo examples created by qwen 3.6 35a3, to the quick start at https://aibynick.com/thread/29

The more I use this model to write actually useful code and to perform useful agentic tasks, the more it becomes just blindly obvious that this model is heads and shoulders above any other small local model out there. It's truly incredible what both the qwen 3.6 models can accomplish (35a3 is an MOE model, and 27 is a dense architecture). The 27b dense version is even more capable than 35a3, but does need a fast GPU for anything but the smallest workloads. It's particularly useful when you use it to help out with portions of tasks which the 35a3 gets stuck on. Use 35a3 to do 95% of a workflow, then call in 27b when extra help is needed.

The 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 and 3090 class GPUs, on Apple Mac processors with less RAM, etc.

These were built with qwen 3.6 35a3 on a Strix Halo laptop:

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM:

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, which require much beefier hardware. For example, this is the sort of disappointing output which GPT-OSS:120b would produce (fine as a start, but you'd want to put in lots of additional work for a public facing site):

The output of that tiny qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. Here's an example made by mimo25pro (on Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

It's incredible to see Qwen besting the Gemma 4 models in both quality and performance speed. If you haven't tried the newest qwen models, you need to give them a shot. I keep a stable of other models loaded to offer different perspectives - especially some bigger models that run slowly even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the 3.6 qwen models, even compared to those very large and slow alternatives. The quen 3.6 models feel like the first truly capable, generally useful LLMs for small machines.

Please login to post a reply.

© 2026 AI By Nick.