Clustering Macbook Pro laptops

swampstream
Apr 20, 2026 at 21:37 (edited, 1 revision)

Off topic: Would first like to say, Nick, I really appreciate your non stop effort to share your perspectives on code / app development and now AI. Thank you!

===

Hardware: Been looking for new gear and I thought perhaps Macbooks would be a consideration for some, due to experience with it (I don't have it).

Price: I also noticed that the pricing of Macbook M1 Pro / 32GB is dropping significantly (800 USD some places). 204 GB/s memory bandwidth. Thunderbolt / USB inter comms possible. And this might allow for an alternative path to clustering vs using an ASUS / Tensorbook / etc.

But then again, I may be over my head in thinking that Macbooks are as good as RTX #### gpu's. So I'll leave it at that for now.

History

Nick AntonaccioAdmin
Apr 20, 2026 at 14:48 (edited, 4 revisions)

Thank you for posting :)

It looks like a single M1 Pro with 32GB of RAM should be expected to perform in these ballparks:

7B - 8B models: ~39 tokens per second (4-bit quantization)

14B - 20B: ~13–19 tokens per second

30B - 32B: ~9 tokens per second (8-bit quantization would likely push the 32GB RAM limit to the point you could experience crashes)

From what I've seen, when clustering Macs, Exo framework provides the absolute best performance. Be sure to connect with Thunderbolt cables, and beware that the M1 Pro Thunderbolt 4 is limited to 40Gb/s. I'd expect you might be able to run a 70B parameter models at 4-bit or 5-bit on 2 M1 Pros with 32GB each, but performance will likely be pretty darn slow.

Although not the same as what you're considering, Alex Ziskind did cluster a bunch of different Mac laptops last year:

https://www.youtube.com/watch?v=uuRkRmM9XMc

I'm really betting on better small models that run on 16-24GB VRAM GPUs this year, especially for coding and agentic tasks, which make a single M1 Pro, or any of the inexpensive 3080ti mobile and similar machines potentially useful (I got one of those 16GB VRAM RTX laptops used for just over $800 - that machine has been very impressive - even ran Nemotron Super on it!).

Seeing how much qwen 3.6 improved over previous models is really encouraging. I'm going to really put it through the paces on a little RTX laptop over the next few weeks, for coding and agentic tasks especially.

For little models like qwen 3.6, a single laptop with an RTX GPU will likely perform faster than the M1 Pro, but M1 Pro has the benefit of Thunderbolt network speeds, so clustering with it is faster out of the box, unless you spend too much money on network hardware for the RTX machines. I'd like to try EXO with a couple RTX machines, using the best networking hardware that is reasonable to set up with a couple of them...

History

swampstream
Apr 27, 2026 at 23:30

Noticed a Macbook pro M1, so a little less beefy than the M1 pro chip. 16GB unified memory. For around 500 USD. I am guessing the unified memory also needs to be used for normal processes like using a browser... so the LLM can't fully consume the unified memory like on a GPU?

It got me thinking about unified memory vs shared memory + GPU vs just a GPU + ram, these seems to be the different options.

Pretty tough making decisions on what works best per USD/EUR.

Nick AntonaccioAdmin
Apr 28, 2026 at 01:38

You'd be extremely limited with that machine. Yes, you do need some of that unified memory to run the OS. You'll be stuck running tiny, basically toy models, with very little working space for KV cache. Something like Qwen 3.5 9b will max it out, and don't plan on writing any production code with that model.

Do a Google search, or use ChatGPT to help find machines which will run at least Qwen 3.6 35b and Gemma 4 26b, at least at 4 bit quantization. That's a minimum barrier of entry for getting any actual work completed with self-hosted models. You may still be able to find some sub-$1000 Windows laptops, for example, with a dedicated mobile RTX 3080ti GPU (not the desktop 3080 - the desktop version of that GPU does not have enough VRAM). Those are at the lowest end of usable hardware for any sort of practical LLM inference.

You need a bare minimum of 16 GB VRAM on a dedicated GPU, or at least 32Gb shared RAM on an M series Mac, to do anything useful, but even 32Gb on any of the Macs with unified memory, won't leave you any real room for context, even with heavily compressed models. If you want to stick with Mac, you should really have at least 64Gb unified memory, to get any actual inference work done. You should expect $1350-$1600 for a Mac Studio (M1 Max) with 64Gb or a MacBook Pro 14" (M1 Max).

Consider looking alternately for a tower with an RTX 3090 or dual RTX 3060s. You can find RTX 3060s all over the place for less than $300 each, and 2 of them give you 24Gb VRAM - that's probably still the best buy on the market for running some of the smallest useful models. If you happen to have a motherboard that can support them, that GPU is a no-brainer for price/performance. You're still going to be really constrained with those small GPUs, but the newest Qwen 3.6 and Gemma 4 Mixture of Expert models can run usably fast on them, and they can get tasks done, especially if you're using a lightweight agentic harness like Pi.

A Strix Halo machine like https://www.amazon.com/gp/product/B0DW238TXK will put you into an entirely different class of LLM inference. It's a world of difference. Only look at the 128Gb shared RAM models, if you want the step up in LLM inference performance (do not get a 32GB model - it won't be able to run much).

If you're planning on doing image and video generation, and potentially considering using Strix Halo, then read up on exactly which models run well on ROCm, as opposed to nVidia CUDA. ROCm is getting better and more supported, but I haven't tested a lot of the other types of AI models aside from LLMs on Strix Halo yet.

Nick AntonaccioAdmin
Apr 28, 2026 at 14:46 (edited, 5 revisions)

For writing code, expect the experience using Qwen 3.6 35b and Gemma 4 26b to feel a bit retro, perhaps reminiscent of using GPT4o in some ways. Benchmarks show GPT4o and Qwen 3.6 35b to be similarly capable, but I have a sense that some of Qwen's capabilities are more modern. That self-hosted Qwen model won't always write perfect code first-shot like the newest frontier models, but that situation is handled more effectively than it could have been in GPT4o, because an agentic harness like Pi can be used to iterate autonomously through debug revisions.

Qwen3.6 35b nearly always provides move effective output than Gemma 4 26b, but it's sometimes nice to have one model check the other's work, to help break out of loops when one model can't seem to solve an issue. The dense Qwen 3.6 27b model is even more significantly capable than either the Qwen or Gemma MOE models, but it runs much more slowly on small GPUs (it's even slow on the Strix Halo).

Basically every other model currently available for the 16-24b VRAM class GPUs will provide lower quality output at the same speed, than those current models.

Also be aware that those models don't have the deep world knowledge that GPT4o and other huge models have. Nothing that runs on the small consumer GPUs will have that sort of tremendous world knowledge (endless info about obscure topics). There's only so much information that an be stored in 35 billion parameters (models like Kimi k2.6 are over a trillion parameters, and each of its 384 active experts is bigger then the entire Qwen 27b model!)

Those little Qwen and Gemma models are really optimized for coding and agentic tasks.

Finally, be aware that you want to use an agentic harness that sends the smallest amount of token overhead with each prompt, to the LLM, in order to avoid burning tons of extra tokens with each prompt. Pi is the king in that regard, for use with local LLMs. If you want to do local inference, get to know Pi. Hermes, for example, requires well over 100K context size to even operate. You'll be waiting constantly for your prompts just to process, if you use a heavy agent like that with a self-hosted LLM on a small consumer GPU.

History

Nick AntonaccioAdmin
Apr 28, 2026 at 23:29 (edited, 3 revisions)

Wow, I just did an Ebay search for 'laptop 3080ti' and the RAM prices are continuing to drive total machine cost significantly higher:

https://www.ebay.com/sch/i.html?_nkw=laptop+RTX+3080ti+&_sacat=0&_from=R40&_trksid=p2553889.m570.l1313

There was just a single Buy It Now listing for $1099:

https://www.ebay.com/itm/358479918274?_skw=laptop+RTX+3080ti&itmmeta=01KQA623BY0PAY0YW104F7XEZV&hash=item53771174c2:g:2eEAAeSwVNVp6pcA&itmprp=enc%3AAQALAAAA0GfYFPkwiKCW4ZNSs2u11xCpy8oUQ%2FqkPAfu%2FFRP7MB6R8Gfo5HauTomDGMzS%2FNjQFW3A%2BMZ5LYntlfOCpKU%2BKRp%2FJ6W5gj1nVqv2A4qKZ%2BRNZpjzAvV6iasa85D8PgY%2FzSoiQLEZG%2BphJ%2BthWkj%2FAQURTDVTgfluk9sfhN9lhZBDLTvuIW2QgjEIBC7YA0w2pZKAlfxjJPhzI6lw%2F4trcZYPBkrXGVxx6uq01coyVVZJNACldSWXDARUXDxBKKp8GECCiIZXkjBvcBV9TPAt7o%3D%7Ctkp%3ABk9SR5C2iMa6Zw

There were a bunch of similar machines for around $1300, but all of those current listings come with only 32Gb RAM (along with 16GB VRAM in the GPU).

Everything I bought even a month ago had at least 64Gb RAM, and all those machines were less than $1000 😲

If you're using models that fit entirely into VRAM, I'm not sure how much of a difference that RAM situation would make - I'd expect that as long as you're not running a bunch of other applications while performing inference, and offloading all the model layers onto the GPU, inference performance should not take a hit. But I don't have a machine currently set up with 32Gb RAM to test that.

So that makes the current prices for Strix Halo even more impressive. For anything more performant than the sorts of machines with small Nvidia GPUs, I'd take a serious look at Strix Halo. A like-new ASUS ROG Flow Z13 with 128Gb shared memory is holding steady at $2572 (+ tax) on Amazon:

https://www.amazon.com/gp/product/B0DW238TXK/ref=ox_sc_saved_title_1?smid=A2L77EE7U53NWQ&psc=1

I don't think there's currently any better bang for the buck than Strix Halo, for LLM inference.

The next closest competitor is probably an ASUS Ascent GX10 for $3493 (+ tax):

https://www.amazon.com/gp/product/B0G1MQYHRD/ref=ox_sc_act_title_2?smid=ATVPDKIKX0DER&th=1

You're looking at $4000+ for a comparable Apple silicon product, and that Asus does have a real Nvidia GPU , so you can use a genuine CUDA stack for any sorts of models which require CUDA (video generation models, for example, which perform much better with CUDA), and all those little mini machines with an Nvidia GB10 chip are made for clustering, with NVLink-C2C network connections built in. I am really interested in those machines, for that reason, but for really big LLM models, a Mac Studio can also be clustered natively:

https://www.apple.com/shop/buy-mac/mac-studio/m3-ultra-chip-32-core-cpu-80-core-gpu-256gb-memory-4tb-storage

I wrote a post about why I'm considering clustering with those big rig machines, at some point:

https://aibynick.com/thread/26

For now, though, I'm still using APIs for all my commercial code generation work. ChatGPT still costs me only $20 per month for an absolutely outrageous volume of inference, and Google/gemini-3.1-flash-lite-preview is ridiculously inexpensive, fast, and effective to use in agents. That Gemini model feels almost free to use - I've been averaging about $1 per 10 million tokens (combined in/out for the particular tasks I've run on it lately). It's much smarter, more capable, knowledgeable and dramatically faster than any local LLM you could self host.

So don't sweat getting hardware now. Qwen 3.6 35b has been the first local model that actually seems to make the thought of performing production coding work with a self-hosted GPU doable, but it's still not anywhere near as good as basically any huge mode. I'm amazed at what that model can achieve with only a 16Gb GPU, but the testing I do with it is just being completed so that I have a workable system if/when any of those services were to evaporate, or for example, if/when those services were to ever experience outages. If they were to ever disappear completely, I'd immediately buy a big clustered setup like the ones in the link above, and run GLM, Kimi, Minimax, etc.

1 History

Nick AntonaccioAdmin
Apr 29, 2026 at 00:06 (edited, 5 revisions)

There really isn't much need to rely on locally hosted models, or to buy server hardware right now. The only critical local inference use for me is processing private data in HIPAA compliant ways, but if needed, I could replace those tasks using, for example, Azure OpenAI Service, Google Vertex AI, AWS Bedrock, or other APIs that provide fully HIPAA compliant offerings, signed BAAs, etc.

For everything else - all my real work - I rely on ChatGPT and hosted LLM APIs (the Gemini 3.1 Flash Lite preview has been an amazing workhorse lately).

There are so many companies all over the world competing , that I don't expect cheap high quality inference to just disappear any time soon. I do expect there will be plenty of corporate financial casualties at some point, and probably at least some providers will go out of business.

You can see Anthropic's inference prices skyrocketing recently, and they continue to lock down their APIs more and more to be used primarily with their own tools. Maybe that's a sign of things to come, but at the same time, the Chinese companies keep putting out extraordinarily high quality models, with APIs at a fraction of the cost. GLM, Kimi, Xiaomi, and Deepseek's prices are all low - and I don't think Google will disappear or stop competing on price any time soon. Google owns their own TPUs, they control their entire research stack from the ground up, they're incredibly well funded, and they're in it for the long run.

And there are always big new models released on Openrouter which you can use for a while entirely for free (with data caps). The big Qwen 3.6 Plus model was available for weeks for free, and Openrouter continues to provide consistently free access to models which compete with or utterly beat anything that would run on the kind of consumer GPU hardware we're discussing in this thread.

So if you really want to maximize a local Mac workflow, right now is a great time to test out all the local agentic harnesses, and use the cheap/free LLM APIs on Openrouter. Get to know the strengths and weaknesses of all those cheap hosted models, and explore the exact models you'd run locally, many of which have been consistently free on Openrouter for a long time (such as GPT-OSS:120b, which has had a free offering since last summer):

https://openrouter.ai/models?q=free

Many of the free models have usage caps, but you can use them to get to know which ones you like, and actually get some work done. Here are just a few of the current freebies:

OpenRouter Official Collection
openrouter/free - Free Models Router (auto-selects from available free models)
Tencent
tencent/hy3-preview:free - Hy3 preview (Going away May 8, 2026)
InclusionAI
inclusionai/ling-2.6-1t:free - Ling-2.6-1T (Going away April 30, 2026)
inclusionai/ling-2.6-flash:free - Ling-2.6-flash (Going away April 29, 2026)
NVIDIA
nvidia/nemotron-3-super-120b-a12b:free - Nemotron 3 Super (120B MoE)
nvidia/nemotron-3-nano-30b-a3b:free - Nemotron 3 Nano 30B A3B
nvidia/nemotron-nano-9b-v2:free - Nemotron Nano 9B V2
nvidia/nemotron-nano-12b-v2-vl:free - Nemotron Nano 12B 2 VL
nvidia/llama-nemotron-embed-vl-1b-v2:free - Llama Nemotron Embed VL 1B V2
OpenAI
openai/gpt-oss-120b:free - GPT-OSS 120B (MoE, 5.1B activated parameters)
openai/gpt-oss-20b:free - GPT-OSS 20B
Google
google/lyria-3-pro-preview:free - Lyria 3 Pro Preview
google/lyria-3-clip-preview:free - Lyria 3 Clip Preview
google/gemma-4-31b-it:free - Gemma 4 31B IT
google/gemma-4-26b-a4b-it:free - Gemma 4 26B A4B IT
google/gemma-3-27b-it:free - Gemma 3 27B IT
google/gemma-3-12b-it:free - Gemma 3 12B IT
google/gemma-3-4b-it:free - Gemma 3 4B IT
google/gemma-3n-e2b-it:free - Gemma 3N E2B IT
google/gemma-3n-e4b-it:free - Gemma 3N E4B IT
Z.ai
z-ai/glm-4.5-air:free - GLM 4.5 Air
MiniMax
minimax/minimax-m2.5:free - MiniMax M2.5
Qwen (Alibaba)
qwen/qwen3-coder-480b-a35b-instruct:free - Qwen3 Coder 480B A35B
qwen/qwen3-next-80b-a3b-instruct:free - Qwen3 Next 80B A3B
Meta (Llama)
meta-llama/llama-3.3-70b-instruct:free - Llama 3.3 70B Instruct
meta-llama/llama-3.2-3b-instruct:free - Llama 3.2 3B Instruct
Nous Research
nousresearch/hermes-3-llama-3.1-405b:free - Hermes 3 Llama 3.1 405B
LiquidAI
liquid/lfm-2.5-1.2b-thinking:free - LFM 2.5-1.2B Thinking
liquid/lfm-2.5-1.2b-instruct:free - LFM 2.5-1.2B Instruct
Cognitive Computations
cognitivecomputations/dolphin-mistral-24b-venice-edition:free - Dolphin Mistral 24B Venice Edition
ByteDance (Experimental)
bytedance/seedance-1-5-pro:free - Seedance 1.5 Pro (experimental)

So it's a great time to get really good at using Hermes, Pi, and all the claw agents, with all those models. Then, if there's ever a reason to rely on a locally hosted LLM, you can just switch your agents over to using a model running in LM Studio, Ollama, Jan, etc., and your entire pipeline, together with all your established workflows, can otherwise stay completely in place. Changing LLM APIs takes just a few seconds with Openrouter, and nothing about the rest of your workflow needs to change.

You can run most agent applications on $100 netbooks, so any Mac should work, if you really prefer to stay in the Mac ecosystem for local hardware. Really, you can run agents on almost any local machine (even Raspberry Pi's and mobile phones), so just hook them up to a good cheap sustainable LLM API and get work done:

https://openrouter.ai/models

I've been using a $20/month ChatGPT subscription for years, to complete absolutely huge volumes of production code development, and OpenAI has never imposed rate limits or data caps on that work (amazingly), but that could get blocked abruptly at any moment, just like API access to Claude for Openclaw was abruptly stopped by Anthropic recently. And OpenAI recently put an end to Sora, which seemed world changing just 1 year ago. I fully expect more companies to follow suit, stopping or capping access to loss leaders like the ChatGPT interface, which they use to acquire users, hoping to funnel organizations into paid API access. Eventually all those free tokens need to make money, or come to an end.

I've prepared for the inevitable eventuality of services being cut off, by getting local agents like Hermes and Pi set up and ready to go with models like Google/gemini-3.1-flash-lite-preview (or any others that work perfectly well for my needs). I've used those tools to build several significant production software solutions, so I trust them, know how to use them well, and can instantly drop any project that I'm currently working on with ChatGPT, directly into those environment, and continue without even a hiccup.

But all the companies won't just go out of business at once, or stop every inexpensive API service just like that. As long as I can connect with a high quality LLM API somewhere, the whole local agent workflow works just fine, and is very productive - and I can switch to any locally hosted LLM that I trust, if/when it's ever needed (perhaps I want to travel somewhere there's no Internet available). And if the whole AI ecosystem were to come crumbling down around us, I'd push forward with those things, but that's not on the foreseeable horizon.

So for now, I do all my local LLM inference configuration/testing basically for my own interest, and of course to keep up with the ecosystem developments, and to actually have some usable self-hosted tools available, but I'm not going to stop using hosted APIs for my core daily work, any time soon, especially as long as outrageously cheap solutions like ChatGPT and google/gemini-3.1-flash-lite-preview are available.

So, all that discussion is to convey: don't feel pressed to buy a server right now. Hardware is stupidly expensive at the moment. Put effort instead into running local agents on any machines you own already. Use Macs or whatever else you've got. Get to know Openrouter and all the hundreds of models you can use there. Take advantage of free preview models whenever they come out. And if a company does happen to go out of business, watch for fire sales of all that datacenter hardware ;)

1 History

Please login to post a reply.