Post History - AI By Nick

Current VersionMay 09, 2026 at 17:23

I'm really excited to see what the Qwen 3.6 35a3 MOE model can do for coding and agentic tasks on small consumer GPUs.

On my laptop with the mobile RTX 3080ti 16GB VRAM, using LM Studio default settings for the Cuda runtime, this model ran at:

13 tokens per second with 8 bit quantization
16 tokens per second with 6 bit quantization (6 bit is likely basically as reliable as 8 bit for most purposes)
24 tokens per second with 4 bit quantization

On my Strix Halo machines, using LM Studio default settings for the Vulkan 2.13.0 runtime:

46 tokens per second with 8 bit quantization

At that speed, there's hardly a need to use a more quantized version (Strix Halo is really turning out to be a great machine for the money, keeping in mind that it can handle much bigger models than Qwen 3.6 35a3).

Running a knowledge query with Qwen 3.6 35a3 connected to the Internet (in Jan), yielded truly great results. This little model can do very impressive research.

Disconnected from the Internet, Qwen 3.6 35a3 is not a 'world knowledge' model, but it immediately seems to be better than Qwen 3.5 35a3 at writing code, and my first tests seem to indicate it also generally does better than Gemma4 31b dense and 26b MOE for text based tasks. I have a sense this may be the current best all-around task model for small local GPUs, especially for tasks that involve writing code. As a big bonus, Qwen3.6:35a3 also supports images, audio, and video (though Qwen's Omni models have deeper multi-modal capabilities).

Over the next few weeks, I'm most inclined to test the Qwen 3.6 35a3 model, together with Pi & Nullclaw, on local light-weight GPUs. I'll put it up head to head against Gemma 4 31b and 26b, as well as GPT-OSS:120b and 20b. Those are the current leading players in the small GPU LLM market, and I'm excited to see some very strong models that can run on sub-$1000 used laptops.

BTW, for 'world knowledge' on a small GPU, Nemotron 3 Super at IQ3_XXS is an impressive little self-contained encyclopedia (it only runs at 4.5 tps on that 3080ti mobile, but that's usable for knowledge lookup). GPT-OSS:120b (11.5 tps on the 3080ti mobile) and heavily quantized Minimax (iq3_s) are also good little knowledge LLMs on GPUs such as Strix Halo and DGX Spark. And of course, even an older small model like GPT-OSS:20b can do a great job researching knowledge, if it has access to the Internet (GPT-OSS:20b is blazing fast on those small GPUs and can do an impressive job with web research).

Previous Versions

Version 3May 09, 2026 at 17:23

I'm really excited to see what the Qwen 3.6 35a3 MOE model can do for coding and agentic tasks on small consumer GPUs.

On my laptop with the mobile RTX 3080ti 16GB VRAM, using LM Studio default settings for the Cuda runtime, this model ran at:

13 tokens per second with 8 bit quantization
16 tokens per second with 6 bit quantization (6 bit is likely basically as reliable as 8 bit for most purposes)
24 tokens per second with 4 bit quantization

On my Strix Halo machines, the 8 bit quant ran at 46 tokens per second, using LM Studio default settings for the Vulkan 2.13.0 runtime (Strix Halo is really turning out to be a great machine for the money, keeping in mind that it can handle much bigger models than Qwen 3.6 35a3).

Running a knowledge query with Qwen 3.6 35a3 connected to the Internet (in Jan), yielded truly great results. This little model can do very impressive research.

Over the next few weeks, I'm most inclined to test the Qwen 3.6 35a3 model, together with Nullclaw, on local light-weight GPUs. I'll put it up head to head against Gemma 4 31b and 26b, as well as GPT-OSS:120b and 20b. Those are the current leading players in the small GPU LLM market, and I'm excited to see some very strong models that can run on sub-$1000 used laptops.

BTW, for 'world knowledge' on a small GPU, Nemotron 3 Super at IQ3_XXS is an impressive little self-contained encyclopedia (it only runs at 4.5 tps on that 3080ti mobile, but that's usable for knowledge lookup). GPT-OSS:120b (11.5 tps on the 3080ti mobile) and heavily quantized Minimax are also good little knowledge LLMs on small GPUs. And of course, even GPT-OSS:20b can do a great job researching knowledge, if it has access to the Internet (GPT-OSS:20b is blazing fast on those small GPUs and can do an impressive job with web research).

Version 2Apr 20, 2026 at 18:09

I'm really excited to see what this model can do for coding and agentic tasks on small consumer GPUs.

On my laptop with the mobile RTX 3080ti 16GB VRAM, using LM Studio default settings for the Cuda runtime, this model ran at:

13 tokens per second with 8 bit quantization
16 tokens per second with 6 bit quantization (6 bit likely basically as reliable as 8 bit for most purposes)
24 tokens per second with 4 bit quantization

On the Strix Halo the 8 bit quant ran at 46 tokens per second, using LM Studio default settings for the Vulkan 2.13.0 runtime (Strix Halo is really turning out to be a great machine for the money).

Qwen 3.6 35a3 is not a 'world knowledge' model, but it immediately seems to be better than Qwen 3.5 35a3 at writing code, and my first tests seem to show it also generally doing better than Gemma4 31b dense and 26b MOE for text based tasks. I have a sense this may be the best all around task model for small local GPUs. As a big bonus, Qwen3.6:35a3 also supports images, audio, and video (though Qwen's Omni models have deeper multi-modal capabilities).

This model, together with Nullclaw is what I'll likely spend the most time with testing local inference on light weight GPUs in the immediate future. I'll put it up head to head against Gemma 4 31b and 26b, as well as GPT-OSS:120b. Those are the current leading players in the small GPU LLM market.

BTW, for world knowledge on a small GPU, Nemotron 3 Super at IQ3_XXS is an impressive little self-contained encyclopedia (it only runs at 4.5 tps on that 3080ti mobile, but that's usable for knowledge lookup). GPT-OSS:120b (11.5 tps on the 3080ti mobile) and heavily quantized Minimax are also good little knowledge LLMs on small GPUs. And of course, even GPT-OSS:20b can do a great job researching knowledge, if it has access to the Internet (GPT-OSS:20b is blazing fast on those small GPUs and can do an impressive job with web research).

Version 1Apr 20, 2026 at 18:00

I'm really excited to see what this model can do for coding and agentic tasks on small consumer GPUs.

On my laptop with the mobile RTX 3080ti 16GB VRAM, using LM Studio default settings for the Cuda runtime, this model ran at:

13 tokens per second with 8 bit quantization 24 tokens per second with 4 bit quantization

Assume 6 bit is somewhere right in between, and probably basically as reliable as 8 bit for most purposes.

On the Strix Halo, the 8 bit quant ran at 46 tokens per second, using LM Studio default settings for the Vulkan 2.13.0 runtime.

This is not a 'world knowledge' model, but it immediately seems to be better than Qwen 3.5 35a3 at writing code. I have a sense this may be the best all around task model for small local GPUs. As a big bonus, Qwen3.6:35a3 also supports images, audio, and video (though Qwen's Omni models have deeper multi-modal capabilities).

This model, together with Nullclaw is what I'll likely spend the most time with testing local inference on light weight GPUs in the immediate future.

BTW, for world knowledge on a small GPU, Nemotron 3 Super at IQ3_XXS is an impressive little self-contained encyclopedia (it only runs at 4.5 tps on that 3080ti mobile, but that's usable for knowledge lookup). GPT-OSS:120b (11.5 tps on the 3080ti mobile) and heavily quantized Minimax are also good little knowledge LLMs on small GPUs. And of course, even GPT-OSS:20b can do a great job researching knowledge, if it has access to the Internet (GPT-OSS:20b is blazing fast on those small GPUs and can do an impressive job with web research).