My current favorite combinations of agents and models

Things are changing so quickly, but my current favorite combinations of agents and models to use together for various classes of tasks, are below. I'll update in this thread as these preferences evolve:

For most daily software development work, I use ChatGPT, with the $20/month zip file routine that I've covered in depth in a pinned thread on this forum. ChatGPT has all the built-in code writing and internal agentic capabilities I need to complete any sort of complex development project. That workflow is still my daily driver - it's been fantastically successful with some absolutely huge projects. I can use it on any platform, without installing anything on a local machine. I love that for portability when I travel - I don't need any sort of dedicated 'development' machine to do any of my normal work. I mostly use small, lightweight netbooks that cost less than $100, and my total required hosted service fees for all inference and VPS services are less than $300 per year.

For very large, complicated, long running software development tasks which involve lots of unattended iteration/revision/testing cycles, Hermes agent together with Google/gemini-3.1-flash-lite-preview is an absolutely killer combination. It's like having a high quality replacement for Claude Code and Claude Opus, at 1/20 the cost, without any rate limiting, and with performance that is much, much faster (I use that Gemini 3.1 Flash Lite model, and all my other LLM APIs on Openrouter). Recently I completed a massively complex task overnight which burned 23 millions tokens, at a total cost of $2.50 with Gemini 3.1 Flash Lite - it's staggeringly capable, knowledgeable, and fast, for the money.

The Hermes agent has many useful tools and skills built in, and it does a great job of getting better as you use it (Hermes advertises 'self-improving skills'). I tend to install it directly on Unbuntu VPS servers, and interact with it at the SSH command line (if you want to use it in Windows, you need to install WSL2).

For software development using locally hosted LLMs, my current favorite setup is Qwen3.6-35b-a3b (q4_k_m on small GPUs and q6_k_m on the Strix Halo), running in LM Studio, and using the Pi (pi.dev) agent with it.

To give you an idea of the output of this combination, the demo forum app below was built in a single short unattended session, using that stack, on my little Linux laptop that has only an RTX 3080 with 16GB VRAM:

http://1y1z.com:5993

That's a far better result than GPT could provide last year, with much less human work involved. Qwen3.6:35b-a3b is their newest MOE model, which runs plenty fast even on very inexpensive GPUs (that forum app was built in one sitting on a laptop I bought complete for ~$800).

Qwen's 3.6 27b dense model is even higher quality, but it runs much more slowly. Save it for when you have a really tricky problem/detail that Qwen 35b needs help with.

I also like the Gemma4 26b MOE and 31b dense models. They're great at adding certain specialized capabilities on small GPUs, and that Google MOE model also runs very quickly.

This was done with Qwen 3.6 35a3 and Gemma 4 26a4 MOE on a Strix Halo laptop, also using Pi as the agent:

http://1y1z.com:5994

Pi agent is especially good to use with locally hosted models because it sends the smallest volume of prompt token boilerplate/overhead, of any of the really capable agents that I'm aware of. It's fast, easy to use, and guardrails-free out of the gate, so it doesn't get in the way of any tasks your intend to perform (be careful with it!). Pi is the agent that Openclaw is built on (Openclaw just adds a bunch of features and skills), it's extremely capable, but without the bloat of Openclaw, Hermes, and those other big systems which are more suited to using hosted LLM APIs. I've prepared single-file copy/paste PI install scripts for both Linux and Windows, so setting it up is instant.

Be careful not to discount what the Qwen3.6 + Pi stack can accomplish. I don't even need my Strix Halo machines to run that combo - it turns inexpensive 'hobby' GPUs into very capable AI software development workhorses. And on the Strix Halo machines it's even faster (~45 TPS!). Using that combination is far more effective than using GPT-4o was last year.

BTW, I have all my locally hosted LLM infrastructure running on machines which are entirely remotely accessible. That little demo forum app was in fact created remotely by logging into the least expensive Linux laptop I own (the one with the 16GB RTX 3080), using a fully self-hosted version of Rustdesk and the open source Rustdesk ID server installed on a cheap VPS. Nothing in that setup requires any commercially hosted service whatsoever (I could even run the ID service on a local machine), and the whole thing is completely resilient to remote restarts, power outages, etc. (everything just restarts if there ever is an issue). That RTX 3080 machine uses something under 200 watts of power, and all my other little GPU servers similarly absolutely sip power. They are all running idle all the time, and I really haven't seen an noticeable increase in my power use - that might change if they were all running full tilt all the time, but all 5 machines are fully usable together on normal light home electrical circuits. Also, all but one of them are laptops, which I keep the lids shut on, and out of the way in back rooms - they all get accessed remotely, so I don't have any big unsightly server running anywhere. I could bring all 4 of the laptop machines together in a little backpack, anywhere. The remote development process is as smooth and simple as can be. When an application is built, it typically gets zipped up and SCPd to a VPS server to run, in the same way each version of any app I create with the GPT zip file workflow does.

Finally, for small local tasks, Nullclaw is an amazingly capable tiny tool. You can use it to speed up manual work routines on any device. Nullclaw generally doesn't require any prerequisite OS or framework dependencies, and you can run it instantly on virtually any hardware (the Nullclaw web site advertises that it can be installed on $5 boards). I've used it on hosted VPS accounts, local Linux and Windows machines, and on my Android phone. Just copy the single binary file, run the built-in config (and/or manually edit generated JSON config files), choose your LLM provider (for Nullclaw on any device with any Internet connection, I'll currently pick Google/gemini-3.1-flash-lite-preview on Openrouter). In a few seconds, it enables a local assistant that can accomplish real work directly on your device, in any connected service you give it credentials to access, etc.

And Nullclaw is capable of completing long tasks - I've used it to build some very complex code over 24 hours of unattended running. I definitely prefer Pi and Hermes for more complex development workflows, but Nullclaw is certainly able to finish legitimate work, if for some reason you can't install a more full featured agentic system on some device, or if you need a quick one-off install.

So these are just my current preferences for mixing agents with LLMs. The GPT zip file process is still the backbone of my daily development work, but I'm beginning to use locally hosted agents to get more long running development tasks completed, especially the kinds that require lots of iterative interactions, testing loops, and more tasks that involve time consuming installations, server cleanup, etc.

I feel like the Qwen and Gemma open source models, together with Pi and the entire remote access pipeline routine, is now finally becoming a truly viable alternative to commercially hosted frontier LLM models and harnesses. I'll keep using ChatGPT, because it's so smart & inexpensive, and because it has been so reliably effective for big projects over the past 6 months - and I'll keep using hosted models like Gemini 3.1 Flash Lite - but I'm no longer worried that my entire development process might come to a halt if any particular commercial hosted service disappears.

The open source self-hosted ecosystem finally feels like it's good enough to use for real work. The state of the art has come a long way from copying/pasting code from chat sessions with GPT-OSS:20b and OpenWebUI!

I'm looking forward to exploring all the features available in the newest Openclaw (things like build in connectivity with Google Meet...), as well as continuing to use other agents like Goose (which I've embedded in several projects already). And hold on to your seats this year for local LLM models to get much faster, smaller, and more capable. Keep watching as new big models like Deepseek 4 get more mature - and if improvements like Turboquant actually pan out, we could see huge gains in size, speed, and performance on GPUs of all sizes.