Post History

Current VersionApr 28, 2026 at 23:29

Wow, I just did an Ebay search for 'laptop 3080ti' and the RAM prices are continuing to drive total machine cost significantly higher:

https://www.ebay.com/sch/i.html?_nkw=laptop+RTX+3080ti+&_sacat=0&_from=R40&_trksid=p2553889.m570.l1313

There was just a single Buy It Now listing for $1099:

https://www.ebay.com/itm/358479918274?_skw=laptop+RTX+3080ti&itmmeta=01KQA623BY0PAY0YW104F7XEZV&hash=item53771174c2:g:2eEAAeSwVNVp6pcA&itmprp=enc%3AAQALAAAA0GfYFPkwiKCW4ZNSs2u11xCpy8oUQ%2FqkPAfu%2FFRP7MB6R8Gfo5HauTomDGMzS%2FNjQFW3A%2BMZ5LYntlfOCpKU%2BKRp%2FJ6W5gj1nVqv2A4qKZ%2BRNZpjzAvV6iasa85D8PgY%2FzSoiQLEZG%2BphJ%2BthWkj%2FAQURTDVTgfluk9sfhN9lhZBDLTvuIW2QgjEIBC7YA0w2pZKAlfxjJPhzI6lw%2F4trcZYPBkrXGVxx6uq01coyVVZJNACldSWXDARUXDxBKKp8GECCiIZXkjBvcBV9TPAt7o%3D%7Ctkp%3ABk9SR5C2iMa6Zw

There were a bunch of similar machines for around $1300, but all of those current listings come with only 32Gb RAM (along with 16GB VRAM in the GPU).

Everything I bought even a month ago had at least 64Gb RAM, and all those machines were less than $1000 😲

If you're using models that fit entirely into VRAM, I'm not sure how much of a difference that RAM situation would make - I'd expect that as long as you're not running a bunch of other applications while performing inference, and offloading all the model layers onto the GPU, inference performance should not take a hit. But I don't have a machine currently set up with 32Gb RAM to test that.

So that makes the current prices for Strix Halo even more impressive. For anything more performant than the sorts of machines with small Nvidia GPUs, I'd take a serious look at Strix Halo. A like-new ASUS ROG Flow Z13 with 128Gb shared memory is holding steady at $2572 (+ tax) on Amazon:

https://www.amazon.com/gp/product/B0DW238TXK/ref=ox_sc_saved_title_1?smid=A2L77EE7U53NWQ&psc=1

I don't think there's currently any better bang for the buck than Strix Halo, for LLM inference.

The next closest competitor is probably an ASUS Ascent GX10 for $3493 (+ tax):

https://www.amazon.com/gp/product/B0G1MQYHRD/ref=ox_sc_act_title_2?smid=ATVPDKIKX0DER&th=1

You're looking at $4000+ for a comparable Apple silicon product, and that Asus does have a real Nvidia GPU , so you can use a genuine CUDA stack for any sorts of models which require CUDA (video generation models, for example, which perform much better with CUDA), and all those little mini machines with an Nvidia GB10 chip are made for clustering, with NVLink-C2C network connections built in. I am really interested in those machines, for that reason, but for really big LLM models, a Mac Studio can also be clustered natively:

https://www.apple.com/shop/buy-mac/mac-studio/m3-ultra-chip-32-core-cpu-80-core-gpu-256gb-memory-4tb-storage

I wrote a post about why I'm considering clustering with those big rig machines, at some point:

https://aibynick.com/thread/26

For now, though, I'm still using APIs for all my commercial code generation work. ChatGPT still costs me only $20 per month for an absolutely outrageous volume of inference, and Google/gemini-3.1-flash-lite-preview is ridiculously inexpensive, fast, and effective to use in agents. That Gemini model feels almost free to use - I've been averaging about $1 per 10 million tokens (combined in/out for the particular tasks I've run on it lately). It's much smarter, more capable, knowledgeable and dramatically faster than any local LLM you could self host.

So don't sweat getting hardware now. Qwen 3.6 35b has been the first local model that actually seems to make the thought of performing production coding work with a self-hosted GPU doable, but it's still not anywhere near as good as basically any huge mode. I'm amazed at what that model can achieve with only a 16Gb GPU, but the testing I do with it is just being completed so that I have a workable system if/when any of those services were to evaporate, or for example, if/when those services were to ever experience outages. If they were to ever disappear completely, I'd immediately buy a big clustered setup like the ones in the link above, and run GLM, Kimi, Minimax, etc.

Version 3Apr 28, 2026 at 23:29

Wow, I just did an Ebay search for 'laptop 3080ti' and the RAM prices are continuing to drive total machine cost significantly higher:

https://www.ebay.com/sch/i.html?_nkw=laptop+RTX+3080ti+&_sacat=0&_from=R40&_trksid=p2553889.m570.l1313

There was just a single Buy It Now listing for $1099:

https://www.ebay.com/itm/358479918274?_skw=laptop+RTX+3080ti&itmmeta=01KQA623BY0PAY0YW104F7XEZV&hash=item53771174c2:g:2eEAAeSwVNVp6pcA&itmprp=enc%3AAQALAAAA0GfYFPkwiKCW4ZNSs2u11xCpy8oUQ%2FqkPAfu%2FFRP7MB6R8Gfo5HauTomDGMzS%2FNjQFW3A%2BMZ5LYntlfOCpKU%2BKRp%2FJ6W5gj1nVqv2A4qKZ%2BRNZpjzAvV6iasa85D8PgY%2FzSoiQLEZG%2BphJ%2BthWkj%2FAQURTDVTgfluk9sfhN9lhZBDLTvuIW2QgjEIBC7YA0w2pZKAlfxjJPhzI6lw%2F4trcZYPBkrXGVxx6uq01coyVVZJNACldSWXDARUXDxBKKp8GECCiIZXkjBvcBV9TPAt7o%3D%7Ctkp%3ABk9SR5C2iMa6Zw

There were a bunch of similar machines for around $1300, but all of those current listings come with only 32Gb RAM (along with 16GB VRAM in the GPU).

Everything I bought even a month ago had at least 64Gb RAM, and all those machines were less than $1000 😲

If you're using models that fit entirely into VRAM, I'm not sure how much of a difference that RAM situation would make - I'd expect that as long as you're not running a bunch of other applications while performing inference, and offloading all the model layers onto the GPU, inference performance should not take a hit. But I don't have a machine currently set up with 32Gb RAM to test that.

So that makes the current prices for Strix Halo even more impressive. For anything more performant than the sorts of machines with small Nvidia GPUs, I'd take a serious look at Strix Halo. A like-new ASUS ROG Flow Z13 with 128Gb shared memory is holding steady at $2572 (+ tax) on Amazon:

https://www.amazon.com/gp/product/B0DW238TXK/ref=ox_sc_saved_title_1?smid=A2L77EE7U53NWQ&psc=1

I don't think there's currently any better bang for the buck than Strix Halo, for LLM inference.

The next closest competitor is probably an ASUS Ascent GX10 for $3493 (+ tax):

https://www.amazon.com/gp/product/B0G1MQYHRD/ref=ox_sc_act_title_2?smid=ATVPDKIKX0DER&th=1

You're looking at $4000+ for a comparable Apple silicon product, and that Asus does have a real Nvidia GPU , so you can use a genuine CUDA stack for any sorts of models which require CUDA (video generation models, for example, which perform much better with CUDA), and all those little mini machines with an Nvidia GB10 chip are made for clustering, with NVLink-C2C network connections built in. I am really interested in those machines, for that reason, but for really big LLM models, a Mac Studio can also be clustered natively:

https://www.apple.com/shop/buy-mac/mac-studio/m3-ultra-chip-32-core-cpu-80-core-gpu-256gb-memory-4tb-storage

I wrote a post about why I'm considering clustering with those big rig machines, at some point:

https://aibynick.com/thread/26

For now, though, I'm still using APIs for all my commercial code generation work. ChatGPT still costs me only $20 per month for an absolutely outrageous volume of inference, and Google/gemini-3.1-flash-lite-preview is ridiculously inexpensive, fast, and effective to use in agents. That Gemini model feels almost free to use - I've been averaging about $1 per 10 million tokens (combined in/out for the particular tasks I've run on it lately). It's much smarter, more capable, knowledgeable and dramatically faster than any local LLM you could self host.

So don't sweat getting hardware now. Qwen 3.6 35b has been the first local model that actually seems to make the thought of performing production coding work with a self-hosted GPU doable, but it's still not anywhere near as good as basically any huge mode. I'm amazed at what that model can achieve with only a 16Gb GPU, the testing I do with it is just being completed so that I have a workable system if/when any of those services were to evaporate, or for example, if/when those services were to ever experience outages. If they were to ever disappear completely, I'd immediately buy a big clustered setup like the ones in the link above, and run GLM, Kimi, Minimax, etc.

Version 2Apr 28, 2026 at 23:27

Wow, I just did an Ebay search for 'laptop 3080ti' and the RAM prices are continuing to drive total machine cost significantly higher:

https://www.ebay.com/sch/i.html?_nkw=laptop+RTX+3080ti+&_sacat=0&_from=R40&_trksid=p2553889.m570.l1313

There was one Buy It Now listing for $1099:

https://www.ebay.com/itm/358479918274?_skw=laptop+RTX+3080ti&itmmeta=01KQA623BY0PAY0YW104F7XEZV&hash=item53771174c2:g:2eEAAeSwVNVp6pcA&itmprp=enc%3AAQALAAAA0GfYFPkwiKCW4ZNSs2u11xCpy8oUQ%2FqkPAfu%2FFRP7MB6R8Gfo5HauTomDGMzS%2FNjQFW3A%2BMZ5LYntlfOCpKU%2BKRp%2FJ6W5gj1nVqv2A4qKZ%2BRNZpjzAvV6iasa85D8PgY%2FzSoiQLEZG%2BphJ%2BthWkj%2FAQURTDVTgfluk9sfhN9lhZBDLTvuIW2QgjEIBC7YA0w2pZKAlfxjJPhzI6lw%2F4trcZYPBkrXGVxx6uq01coyVVZJNACldSWXDARUXDxBKKp8GECCiIZXkjBvcBV9TPAt7o%3D%7Ctkp%3ABk9SR5C2iMa6Zw

There were a bunch of similar machines for around $1300, but all of those current listings come with only 32Gb RAM (along with 16GB VRAM in the GPU).

Everything I bought even a month ago had at least 64Gb RAM, and all those machines were less than $1000 😲

If you're using models that fit entirely into VRAM, I'm not sure how much of a difference the RAM situation would make - I'd expect that as long as you're not running a bunch of other applications while performing inference, and offloading all the model layers onto the GPU, inference performance should not take a hit. But I don't have a machine currently set up with 32Gb RAM to test that.

For anything more performant than those sorts of machines with small Nvidia GPUs, I'd take a serious look at Strix Halo. They're holding steady at $2572 (+ tax) on Amazon for a like new ASUS ROG Flow Z13 with 128Gb shared memory:

https://www.amazon.com/gp/product/B0DW238TXK/ref=ox_sc_saved_title_1?smid=A2L77EE7U53NWQ&psc=1

I don't think there's currently any better bang for the buck than Strix Halo, for LLM inference.

The next closest competitor is probably an ASUS Ascent GX10 for $3493 (+ tax):

https://www.amazon.com/gp/product/B0G1MQYHRD/ref=ox_sc_act_title_2?smid=ATVPDKIKX0DER&th=1

That Asus does have a real Nvidia GPU , so you can use a genuine CUDA stack for all the sorts of models that require CUDA (or other types of models (video generation, for example) which perform much better with CUDA), and all those little mini machines with a GB10 chip are made for clustering, with NVLink-C2C networking built in. I am really interested in those machines, for that reason, but for really big LLM models, a Mac Studio can also be clustered natively:

https://www.apple.com/shop/buy-mac/mac-studio/m3-ultra-chip-32-core-cpu-80-core-gpu-256gb-memory-4tb-storage

I wrote a post about why I'm considering clustering with those big rig machines, at some point:

https://aibynick.com/thread/26

For now, I'm still using APIs for all my commercial work. ChatGPT still costs me only $20 per month for an absolutely outrageous volume of inference, and Google/gemini-3.1-flash-lite-preview is ridiculously inexpensive, fast, and effective to use as an API in agents. That Gemini model basically feels almost free to use - I've been averaging about $1 per 10 million tokens (combined in/out for the particular tasks I've run on it lately). It's much smarter, more capable, knowledgeable and dramatically faster than any local LLM that you could self host.

Qwen 3.6 35b has been the first local model that actually seems to make the thought of doing real production coding work with a self-hosted GPU doable. I'm amazed at what that model can achieve with only a 16Gb GPU. But, all my local LM inference work and testing is just being completed so that I have a workable system if/when any of those services were to evaporate, or for example, if/when those services were to ever experience outages. If they were to ever disappear completely, I'd buy a big clustered setup like the ones in the link above, and run GLM, Kimi, Minimax, etc.

Version 1Apr 28, 2026 at 14:53

Wow, I just did an Ebay search for 'laptop 3080ti' and the RAM prices are continuing to drive total machine cost significantly higher:

https://www.ebay.com/sch/i.html?_nkw=laptop+RTX+3080ti+&_sacat=0&_from=R40&_trksid=p2553889.m570.l1313

There was one Buy It Now listing for $1099:

https://www.ebay.com/itm/358479918274?_skw=laptop+RTX+3080ti&itmmeta=01KQA623BY0PAY0YW104F7XEZV&hash=item53771174c2:g:2eEAAeSwVNVp6pcA&itmprp=enc%3AAQALAAAA0GfYFPkwiKCW4ZNSs2u11xCpy8oUQ%2FqkPAfu%2FFRP7MB6R8Gfo5HauTomDGMzS%2FNjQFW3A%2BMZ5LYntlfOCpKU%2BKRp%2FJ6W5gj1nVqv2A4qKZ%2BRNZpjzAvV6iasa85D8PgY%2FzSoiQLEZG%2BphJ%2BthWkj%2FAQURTDVTgfluk9sfhN9lhZBDLTvuIW2QgjEIBC7YA0w2pZKAlfxjJPhzI6lw%2F4trcZYPBkrXGVxx6uq01coyVVZJNACldSWXDARUXDxBKKp8GECCiIZXkjBvcBV9TPAt7o%3D%7Ctkp%3ABk9SR5C2iMa6Zw

There were a bunch of similar machines for around $1300, but all of those current listings come with only 32Gb RAM (along with 16GB VRAM in the GPU).

Everything I bought even a month ago had at least 64Gb RAM, and all those machines were less than $1000 😲

If you're using models that fit entirely into VRAM, I'm not sure how much of a difference the RAM situation would make - I'd expect that as long as you're not running a bunch of other applications while performing inference, and offloading all the model layers onto the GPU, inference performance should not take a hit. But I don't have a machine currently set up with 32Gb RAM to test that.

For anything more performant than those sorts of machines with small Nvidia GPUs, I'd take a serious look at Strix Halo. They're holding steady at $2572 (+ tax) on Amazon for a like new ASUS ROG Flow Z13 with 128Gb shared memory:

https://www.amazon.com/gp/product/B0DW238TXK/ref=ox_sc_saved_title_1?smid=A2L77EE7U53NWQ&psc=1

I don't think there's currently any better bang for the buck than Strix Halo, for LLM inference.

The next closest competitor is probably an ASUS Ascent GX10 for $3493 (+ tax):

https://www.amazon.com/gp/product/B0G1MQYHRD/ref=ox_sc_act_title_2?smid=ATVPDKIKX0DER&th=1

That Asus does have a real Nvidia GPU , so you can use a genuine CUDA stack for all the sorts of models that require CUDA (or other types of models (video generation, for example) which perform much better with CUDA), and all those little mini machines with a GB10 chip are made for clustering, with NVLink-C2C networking built in. I am really interested in those machines, for that reason, but for really big LLM models, a Mac Studio can also be clustered natively:

https://www.apple.com/shop/buy-mac/mac-studio/m3-ultra-chip-32-core-cpu-80-core-gpu-256gb-memory-4tb-storage

I wrote a post about why I'm considering clustering with those big rig machines, at some point:

https://aibynick.com/thread/26

For now, I'm still using APIs for all my commercial work. ChatGPT still costs me only $20 per month for an absolutely outrageous volume of inference, and Google/gemini-3.1-flash-lite-preview is ridiculously inexpensive, fast, and effective to use as an API in agents. That Gemini model basically feels almost free to use - I've been averaging about $1 per 10 million tokens (combined in/out for the particular tasks I've run on it lately). It's much smarter, more capable, knowledgeable and dramatically faster than any local LLM that you could self host.

Qwen 3.6 35b has been the first local model that actually seems to make the thought of doing actual production coding work with a self-hosted GPU conceivably doable. I'm amazed what it can achieve with only a 16GB GPU. But, all my local LM inference work and testing is just being completed so that I have a workable system if/when any of those services were to evaporate, or for example, if/when those services were to ever experience outages. If they were to ever disappear completely, I'd buy a big clustered setup like the ones in the link above, and run GLM, Kimi, Minimax, etc.

Previous Versions