Post History

Current version by Nick Antonaccio

Current VersionMay 09, 2026 at 17:13

Ok, if you haven't tried the newest qwen3.6 and gemma4 models, you need to give them a shot. The Qwen 3.6 models feel like the first truly capable, generally useful LLMs for small GPUs.

I added some more demo examples created by Qwen 3.6 35a3 and Gmma 4 26a4, to the quick start at https://aibynick.com/thread/29

The more I use Qwen 3.6 to write actually useful code and to perform useful agentic tasks, the more it becomes blindly obvious that that model is heads and shoulders above any other small locally usable models out there. It's truly incredible what both the Qwen 3.6 models can accomplish (35a3b is an MOE model, and 27b is a dense architecture). The 27b dense version is even more capable than 35a3, but that one does need a fast GPU for anything but the smallest workloads. 27b is particularly useful when you use it to help out on portions of tasks which 35a3 gets stuck on. Use 35a3 to productively accomplish 95% of a workflow, then call in 27b (or some other big model) when extra help is needed.

The Qwen 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 & 3090 class GPUs, on Apple Mac processors with less RAM, etc. I run it at both 8bit and 6bit quantization on the Strix Halo and DGX Spark machines, but it also seems to produce fantastic quality output at 4 bit quantization on smaller GPUs.

These demos were built with Qwen 3.6 35a3 on a Strix Halo laptop (using q6 quant):

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM (using 4 bit quant):

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, and many other models which require much beefier hardware. For example, this is the sort of disappointing HTML layout output which GPT-OSS:120b would produce (fine as a start, but you'd want to put in lots of additional work to polish up styling for a public facing web site):

The output of the miniscule local Qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. Here's an example made by mimo25pro (via Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

Those are pure visual layout examples, but the same quality improvement is apparent in logic and other classes of work. 35a3 is capable of producing fantastic Internet research results, for example.

Qwen tends to best even the local Gemma 4 models in both quality and performance speed, but the Gemma 4 versions (MOE 26a4b and dense 31b) are also no slouches. Here's a little dashboard/web site example:

http://1y1z.com:8284/flashy-site--gemma4-26ba4/

I keep Gemma 4 versions, along with a stable of other bigger models loaded in LM Studio, to offer different perspectives on tasks. This includes some bigger models which run slowly, even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the fast Qwen 3.6 MOE model, to those larg/slow alternatives.

Previous Versions
Version 2May 09, 2026 at 17:13

I added some more demo examples created by qwen 3.6 35a3, to the quick start at https://aibynick.com/thread/29

The more I use this model to write actually useful code and to perform useful agentic tasks, the more it becomes just blindly obvious that this model is heads and shoulders above any other small local model out there. It's truly incredible what both the qwen 3.6 models can accomplish (35a3 is an MOE model, and 27 is a dense architecture). The 27b dense version is even more capable than 35a3, but does need a fast GPU for anything but the smallest workloads. It's particularly useful when you use it to help out with portions of tasks which the 35a3 gets stuck on. Use 35a3 to do 95% of a workflow, then call in 27b when extra help is needed.

The 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 and 3090 class GPUs, on Apple Mac processors with less RAM, etc.

These were built with qwen 3.6 35a3 on a Strix Halo laptop:

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM:

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, which require much beefier hardware. For example, this is the sort of disappointing output which GPT-OSS:120b would produce (fine as a start, but you'd want to put in lots of additional work for a public facing site):

The output of that tiny qwen 3.6 35a3 model even approaches the quality of huge bleeding edge models which require datacenter-class hardware to run. Here's an example made by mimo25pro (on Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

It's incredible to see Qwen besting the Gemma 4 models in both quality and performance speed. If you haven't tried the newest qwen models, you need to give them a shot. I keep a stable of other models loaded to offer different perspectives - especially some bigger models that run slowly even on DGX Spark and Strix Halo hardware (Qwen3.5 122B A10B q5, Minimax 2.7 iq3, etc.) - but I've been tending to prefer the output of the 3.6 qwen models, even compared to those very large and slow alternatives. The quen 3.6 models feel like the first truly capable, generally useful LLMs for small machines.

Version 1May 09, 2026 at 15:56

I added some more demo examples created by qwen 3.6 35a3, to the quick start at https://aibynick.com/thread/29

The more I use this model to write actually useful code and to perform useful agentic tasks, the more it becomes just blindly obvious that this model is heads and shoulders above any other small local model out there. It's truly incredible what both the qwen 3.6 models can accomplish (35a3 is an MOE model, and 27 is a dense architecture). The 27b dense version is even more capable than 35a3, but does need a fast GPU for anything but the smallest workloads.

The 35a3 version runs super quick on even old small GPUs with little VRAM. That makes it really useful on smaller 3080 and 3090 class GPUs, on Apple Mac processors with less RAM, etc.

These demos were built with qwen 3.6 35a3 on a Strix Halo laptop:

And these were completed on a laptop with a lowly mobile RTX3080 with 16GB VRAM:

That absolutely blows away the sorts of output we got from much bigger models such as GPT-OSS:120b, which require much beefier hardware:

And the output of that tiny qwen 3.6 35a3 model even approaches the quality of huge bleading edge models which need data center hardware to run. Here's an example made by mimo25pro (on Openrouter):

http://1y1z.com:8284/demo-website-mimo25pro/

It's incredible to see Qwen besting the Gemma 4 models in both quality and performance speed. If you haven't tried the newest qwen models, you need to give them a shot. I keep a stable of other models loaded to offer different perspectives - especially some bigger models that run slowly even on DGX Spark and Strix Halo hardware - but the 3.6 qwen models feel like the first truly capable, generally useful LLMs for small machines.