Post History

Current VersionApr 28, 2026 at 14:46

For writing code, expect the experience using Qwen 3.6 35b and Gemma 4 26b to feel a bit retro, perhaps reminiscent of using GPT4o in some ways. Benchmarks show GPT4o and Qwen 3.6 35b to be similarly capable, but I have a sense that some of Qwen's capabilities are more modern. That self-hosted Qwen model won't always write perfect code first-shot like the newest frontier models, but that situation is handled more effectively than it could have been in GPT4o, because an agentic harness like Pi can be used to iterate autonomously through debug revisions.

Qwen3.6 35b nearly always provides move effective output than Gemma 4 26b, but it's sometimes nice to have one model check the other's work, to help break out of loops when one model can't seem to solve an issue. The dense Qwen 3.6 27b model is even more significantly capable than either the Qwen or Gemma MOE models, but it runs much more slowly on small GPUs (it's even slow on the Strix Halo).

Basically every other model currently available for the 16-24b VRAM class GPUs will provide lower quality output at the same speed, than those current models.

Also be aware that those models don't have the deep world knowledge that GPT4o and other huge models have. Nothing that runs on the small consumer GPUs will have that sort of tremendous world knowledge (endless info about obscure topics). There's only so much information that an be stored in 35 billion parameters (models like Kimi k2.6 are over a trillion parameters, and each of its 384 active experts is bigger then the entire Qwen 27b model!)

Those little Qwen and Gemma models are really optimized for coding and agentic tasks.

Finally, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens with each prompt. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even operate. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM on a small consumer GPU.

Version 5Apr 28, 2026 at 14:46

For writing code, expect the experience using Qwen 3.6 35b and Gemma 4 26b to feel a bit retro, perhaps reminiscent of using GPT4o in some ways. Benchmarks show GPT4o and Qwen 3.6 35b to be similarly capable, but I have a sense that some of Qwen's capabilities are more modern. That self-hosted Qwen model won't always write perfect code first-shot like the newest frontier models, but that situation is handled more effectively than it could be in GPT40, because an agentic harness like Pi can be used to iterate autonomously through debug revisions.

Qwen3.6 35b is nearly always the better than Gemma 4 26b, but it's sometimes nice to have one model check the other's work, to help break out of loops when one model can't seem to solve an issue. The dense Qwen 3.6 27b model is even more significantly capable than the 35b MOE model, but it runs much more slowly.

Basically every other model currently available for the 16-24b VRAM class GPUs will provide lower quality output at the same speed, than those current models.

Also be aware that those models don't have the deep world knowledge that GPT4o had. Nothing that runs on the small consumer GPUs will have that sort of tremendous world knowledge (endless info about obscure topics). There's only so much information that an be stored in 35 billion parameters (models like Kimi k2.6 are over a trillion parameters, and each of the 384 active experts is bigger then the entire Qwen 27b model!)

Those little Qwen and Gemma models are really optimized for coding and agentic tasks.

Finally, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens with each prompt. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even operate. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM on a small consumer GPU.

Version 4Apr 28, 2026 at 13:53

For writing code, expect the experience using Qwen 3.6 35b and Gemma 4 26b to feel something like it's in the ballpark of using GPT4o. It won't always write perfect code first-shot like the newest frontier models, but that situation can be handled better by using an agentic harness like Pi to iterate autonomously through debug revisions. Qwen3.6 35b is generally the better of those two recent models, but it's sometimes nice to have one model check the other's work, and help break out of loops when one model can seem to solve an issue. Basically every other model currently available for the 16-24b VRAM class GPUs will provide lower quality output at the same speed.

Also be aware that those models don't have the deep world knowledge that GPT4o had. Nothing that runs on those small GPUs will have tremendous world knowledge (endless info about obscure topics). Those models are really optimized for coding and agentic tasks.

Finally, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens with each prompt. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even operate. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM on a small consumer GPU.

Version 3Apr 28, 2026 at 02:25

For writing code, expect an experience using Qwen 3.6 35b and Gemma 4 26b to feel something in the ballpark like using GPT4o. Qwen3.6 35b is generally the better of those two recent models. Basically every other model currently available for the 16-24b VRAM class GPUs will provide lower quality output.

Also be aware that those models don't have the deep world knowledge that GPT4o had. Nothing that runs on those small GPUs will have tremendous world knowledge (endless correct info about obscure topics). Those models are really optimized for coding and agentic tasks.

Finally, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens with each prompt. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even operate. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM on a small consumer GPU.

Version 2Apr 28, 2026 at 02:18

For writing code, expect an experience using Qwen 3.6 35b and Gemma 4 26b to feel something in the ballpark like using GPT4o. Qwen3.6 35b is generally the better of those two recent models. Basically every other model currently available for the 16-24b VRAM class GPUs will provide lower quality output.

Also be aware that those models don't have the deep world knowledge that GPT4o had. Nothing that runs on those small GPUs will have tremendous world knowledge (endless correct info about obscure topics). Those models are really optimized for coding and agentic tasks.

Finally, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens with each prompt. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even operate. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM.

Version 1Apr 28, 2026 at 02:18

For writing code, expect an experience using Qwen 3.6 35b and Gemma 4 26b to be something in the ballpark of using GPT4o. Qwen3.6 35b is generally of those two. Basically every other model currently available for the 16-24b VRAM class GPUs will perform worse.

And those models don't have the world knowledge that GPT4o had. Nothing that runs on those small GPUs will have tremendous world knowledge (endless correct info about obscure topics). Those models are really optimized for coding and agentic use.

Also, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even work. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM.

Previous Versions