There really isn't much need to rely on locally hosted models, or to buy server hardware right now. The only critical local inference use for me is processing private data in HIPAA compliant ways, but if needed, I could replace those tasks using, for example, Azure OpenAI Service, Google Vertex AI, AWS Bedrock, or other APIs that provide fully HIPAA compliant offerings, signed BAAs, etc.
For everything else - all my real work - I rely on ChatGPT and hosted LLM APIs (the Gemini 3.1 Flash Lite preview has been an amazing workhorse lately).
There are so many companies all over the world competing , that I don't expect cheap high quality inference to just disappear any time soon. I do expect there will be plenty of corporate financial casualties at some point, and probably at least some providers will go out of business.
You can see Anthropic's inference prices skyrocketing recently, and they continue to lock down their APIs more and more to be used primarily with their own tools. Maybe that's a sign of things to come, but at the same time, the Chinese companies keep putting out extraordinarily high quality models, with APIs at a fraction of the cost. GLM, Kimi, Xiaomi, and Deepseek's prices are all low - and I don't think Google will disappear or stop competing on price any time soon. Google owns their own TPUs, they control their entire research stack from the ground up, they're incredibly well funded, and they're in it for the long run.
And there are always big new models released on Openrouter which you can use for a while entirely for free (with data caps). The big Qwen 3.6 Plus model was available for weeks for free, and Openrouter continues to provide consistently free access to models which compete with or utterly beat anything that would run on the kind of consumer GPU hardware we're discussing in this thread.
So if you really want to maximize a local Mac workflow, right now is a great time to test out all the local agentic harnesses, and use the cheap/free LLM APIs on Openrouter. Get to know the strengths and weaknesses of all those cheap hosted models, and explore the exact models you'd run locally, many of which have been consistently free on Openrouter for a long time (such as GPT-OSS:120b, which has had a free offering since last summer):
https://openrouter.ai/models?q=free
Many of the free models have usage caps, but you can use them to get to know which ones you like, and actually get some work done. Here are just a few of the current freebies:
OpenRouter Official Collection
openrouter/free - Free Models Router (auto-selects from available free models)
Tencent
tencent/hy3-preview:free - Hy3 preview (Going away May 8, 2026)
InclusionAI
inclusionai/ling-2.6-1t:free - Ling-2.6-1T (Going away April 30, 2026)
inclusionai/ling-2.6-flash:free - Ling-2.6-flash (Going away April 29, 2026)
NVIDIA
nvidia/nemotron-3-super-120b-a12b:free - Nemotron 3 Super (120B MoE)
nvidia/nemotron-3-nano-30b-a3b:free - Nemotron 3 Nano 30B A3B
nvidia/nemotron-nano-9b-v2:free - Nemotron Nano 9B V2
nvidia/nemotron-nano-12b-v2-vl:free - Nemotron Nano 12B 2 VL
nvidia/llama-nemotron-embed-vl-1b-v2:free - Llama Nemotron Embed VL 1B V2
OpenAI
openai/gpt-oss-120b:free - GPT-OSS 120B (MoE, 5.1B activated parameters)
openai/gpt-oss-20b:free - GPT-OSS 20B
Google
google/lyria-3-pro-preview:free - Lyria 3 Pro Preview
google/lyria-3-clip-preview:free - Lyria 3 Clip Preview
google/gemma-4-31b-it:free - Gemma 4 31B IT
google/gemma-4-26b-a4b-it:free - Gemma 4 26B A4B IT
google/gemma-3-27b-it:free - Gemma 3 27B IT
google/gemma-3-12b-it:free - Gemma 3 12B IT
google/gemma-3-4b-it:free - Gemma 3 4B IT
google/gemma-3n-e2b-it:free - Gemma 3N E2B IT
google/gemma-3n-e4b-it:free - Gemma 3N E4B IT
Z.ai
z-ai/glm-4.5-air:free - GLM 4.5 Air
MiniMax
minimax/minimax-m2.5:free - MiniMax M2.5
Qwen (Alibaba)
qwen/qwen3-coder-480b-a35b-instruct:free - Qwen3 Coder 480B A35B
qwen/qwen3-next-80b-a3b-instruct:free - Qwen3 Next 80B A3B
Meta (Llama)
meta-llama/llama-3.3-70b-instruct:free - Llama 3.3 70B Instruct
meta-llama/llama-3.2-3b-instruct:free - Llama 3.2 3B Instruct
Nous Research
nousresearch/hermes-3-llama-3.1-405b:free - Hermes 3 Llama 3.1 405B
LiquidAI
liquid/lfm-2.5-1.2b-thinking:free - LFM 2.5-1.2B Thinking
liquid/lfm-2.5-1.2b-instruct:free - LFM 2.5-1.2B Instruct
Cognitive Computations
cognitivecomputations/dolphin-mistral-24b-venice-edition:free - Dolphin Mistral 24B Venice Edition
ByteDance (Experimental)
bytedance/seedance-1-5-pro:free - Seedance 1.5 Pro (experimental)
So it's a great time to get really good at using Hermes, Pi, and all the claw agents, with all those models. Then, if there's ever a reason to rely on a locally hosted LLM, you can just switch your agents over to using a model running in LM Studio, Ollama, Jan, etc., and your entire pipeline, together with all your established workflows, can otherwise stay completely in place. Changing LLM APIs takes just a few seconds with Openrouter, and nothing about the rest of your workflow needs to change.
You can run most agent applications on $100 netbooks, so any Mac should work, if you really prefer to stay in the Mac ecosystem for local hardware. Really, you can run agents on almost any local machine (even Raspberry Pi's and mobile phones), so just hook them up to a good cheap sustainable LLM API and get work done:
https://openrouter.ai/models
I've been using a $20/month ChatGPT subscription for years, to complete absolutely huge volumes of production code development, and OpenAI has never imposed rate limits or data caps on that work (amazingly), but that could get blocked abruptly at any moment, just like API access to Claude for Openclaw was abruptly stopped by Anthropic recently. And OpenAI recently put an end to Sora, which seemed world changing just 1 year ago. I fully expect more companies to follow suit, stopping or capping access to loss leaders like the ChatGPT interface, which they use to acquire users, hoping to funnel organizations into paid API access. Eventually all those free tokens need to make money, or come to an end.
I've prepared for the inevitable eventuality of services being cut off, by getting local agents like Hermes and Pi set up and ready to go with models like Google/gemini-3.1-flash-lite-preview (or any others that work perfectly well for my needs). I've used those tools to build several significant production software solutions, so I trust them, know how to use them well, and can instantly drop any project that I'm currently working on with ChatGPT, directly into those environment, and continue without even a hiccup.
But all the companies won't just go out of business at once, or stop every inexpensive API service just like that. As long as I can connect with a high quality LLM API somewhere, the whole local agent workflow works just fine, and is very productive - and I can switch to any locally hosted LLM that I trust, if/when it's ever needed (perhaps I want to travel somewhere there's no Internet available). And if the whole AI ecosystem were to come crumbling down around us, I'd push forward with those things, but that's not on the foreseeable horizon.
So for now, I do all my local LLM inference configuration/testing basically for my own interest, and of course to keep up with the ecosystem developments, and to actually have some usable self-hosted tools available, but I'm not going to stop using hosted APIs for my core daily work, any time soon, especially as long as outrageously cheap solutions like ChatGPT and google/gemini-3.1-flash-lite-preview are available.
So, all that discussion is to convey: don't feel pressed to buy a server right now. Hardware is stupidly expensive at the moment. Put effort instead into running local agents on any machines you own already. Use Macs or whatever else you've got. Get to know Openrouter and all the hundreds of models you can use there. Take advantage of free preview models whenever they come out. And if a company does happen to go out of business, watch for fire sales of all that datacenter hardware ;)