Post History

Current version by Nick Antonaccio

Current VersionApr 09, 2026 at 20:51

I've written many times previously on the old rebolforum about my machines with small consumer GPUs, that can actually perform useful AI inference. I'm a big fan of older GPUs with lots of VRAM. The RTX 3090 is still one of the best price/performance buys to run small LLMs locally.

I think the 3090 is pretty well known as a good buy, but what I don't think is as well known is the RTX 3080ti mobile GPU. The *mobile version of the 3080ti (it has to be mobile, and has to be the 'ti' version) actually has more VRAM than the desktop version 3080, so I don't think as many people search for it. That means there are a pile of used machines with this GPU available on Ebay, in the $1000 range. The mobile ti version has 16GB VRAM. I've found that for all practical purposes, those mobile 3080ti GPUs can handle basically any task you can throw at a 24Gb RTX3090 desktop GPU.

GPT-OSS:20b runs super fast (typically 50-100tps) on the 3080 ti mobile, and GPT-OSS:120b is usable, though much much slower (7ish tps). A wide variety of the Qwen 3.5 and Gemma 4 models run well on this GPU. I've gotten a couple laptops with 64Gb RAM (on top of the dedicated 16GB VRAM) and 2TB hard drives, for around $1000 - these are killer buys for this performance bracket.

One other laptop I bought recently was a Clevo Lambda Tensorbook:RTX3080 - I've been blown away by this purchase. This machine comes loaded with Ubuntu 24.04 (which I think has better performing ROCm drivers that Windows) and is listed with a plain old RTX3080 (no ti), but I did some research before purchasing, and this laptop included a special model of the 3080 with 16GB RAM(!)

The Clevo laptop was actually built specifically for AI workloads, and boy does it rock, for the price. I've even been able to use it to run nvidia-nemotron-3-super-120b-a12b at IQ3_XSS quantization - nevertheless only at 3.7 tokens per second - but that's still actually useful for some knowledge tasks, which are kind of unheard of on inexpensive machines (Nemotron Super is currently one of my favorite medium sized models for knowledge work on inexpensive machines). Gemma-4-26b-a4b with Q4_K_M quantization runs at 26tps, so very usable, qwen3.5-35b-a3b with Q8_0 quant runs at 12tps, which is also very useful (and of course a lower quant will run much faster). Good ole GTP-OSS:20b runs at 56tps, and gemma-4-e2b-it goes 90tps.

For a complete, portable computer which cost only $850 on Ebay, that little Clevo gem has probably been my best GPU enabled purchase for small local AI inference tasks.

The Strix Halo ASUS ROG Flow Z13 machine can load significantly bigger models, but it also costs about 3x as much (still an absolutely killer buy for the power), and Nvidia chips still have some advantages which come from the CUDA stack. So do your research to confirm 16GB, but don't hesitate to snatch up one of those Clevo machines, or any of the 3080 ti mobile models. They can actually be used to get coding, agentic work, and even some knowledge work completed, at a fraction of the cost of other hardware solutions.

Previous Versions
Version 2Apr 09, 2026 at 20:51

I've written many times previously on the old rebolforum about my machines with small consumer GPUs, that can actually perform useful AI inference. I'm a big fan of older GPUs with lots of VRAM. The RTX 3090 is still one of the best price/performance buys to run small LLMs locally.

I think the 3090 is pretty well known as a good buy, but what I don't think is as well known is the RTX 3080ti mobile GPU. The *mobile version of the 3080ti (it has to be mobile, and has to be the 'ti' version) actually has more VRAM than the desktop version 3080, so I don't think as many people search for it. That means there are a pile of used machines with this GPU available on Ebay, in the $1000 range. The mobile ti version has 16GB VRAM. I've found that for all practical purposes, those mobile 3080ti GPUs can handle basically any task you can throw at a 24Gb RTX3090 desktop GPU.

GPT-OSS:20b runs super fast (typically 50-100tps) on the 3080 ti mobile, and GPT-OSS:120b is usable, though much much slower (7ish tps). A wide variety of the Qwen 3.5 and Gemma 4 models run well on this GPU. I've gotten a couple laptops with 64Gb RAM (on top of the dedicated 16GB VRAM) and 2TB hard drives, for around $1000 - these are kill buys for this performance bracket.

One other laptop I bought recently was a Clevo Lambda Tensorbook:RTX3080 - I've been blown away by this purchase. This machine comes loaded with Ubuntu 24.04 and is listed with a plain old RTX3080 (no ti), but I did some research before purchasing, and this laptop included a special model of the 3080 with 16GB RAM!

The Clevo laptop was actually built specifically for AI workloads, and boy does it rock, for the price. I've even been able to use it to run nvidia-nemotron-3-super-120b-a12b at IQ3_XSS quantization - nevertheless only at 3.7 tokens per second - but that's still actually useful for some knowledge tasks, which are kind of unheard of on inexpensive machines (Nemotron Super is currently one of my favorite medium sized models for knowledge work on inexpensive machines). Gemma-4-26b-a4b with Q4_K_M quantization runs at 26tps, so very usable, qwen3.5-35b-a3b with Q8_0 quant runs at 12tps, which is also very useful (and of course a lower quant will run much faster). Good ole GTP-OSS:20b runs at 56tps, and gemma-4-e2b-it goes 90tps.

For a complete, portable computer which cost only $850 on Ebay, that little Clevo gem has probably been my best GPU enabled purchase for small local AI inference tasks. The Strix Halo ASUS ROG Flow Z13 machine can load significantly bigger models, but it also costs about 3x as much, and Nvidia chips still have some advantages which come from the CUDA stack. So do your research to confirm 16GB, but don't hesitate to snatch one of these up, or any of the 3080 ti mobile models. They can actually be used to get coding, agentic work, and even some knowledge work completed, at a fraction of the cost of other hardware solutions.

Version 1Apr 09, 2026 at 14:45

I've written many times previously on the old rebolforum about my machines with small consumer GPUs, that can actually perform useful AI inference. I'm a big fan of older GPUs with lots of VRAM. The RTX 3090 is still one of the best price/performance buys to run small LLMs locally.

I think the 3090 is pretty well known as a good buy, but what I don't think is as well known is the RTX 3080ti mobile GPU. The *mobile version of the 3080ti (it has to be the 'ti' version) actually has more VRAM than the desktop version 3080, so I don't think as many people search for it. The mobile ti version has 16GB VRAM, and there are a ton of relatively inexpensive laptops on Ebay with these GPUs (in the $1000 range). I've found that for all practical purposes, those mobile 3080ti GPUs can handle basically any task you can throw at a 24Gb RTX3090 desktop GPU.

GPT-OSS:20b runs super fast (typically 50-100tps) on the 3080 ti mobile, and GPT-OSS:120b is usable, though much much slower (7ish tps). A wide variety of the Qwen 3.5 and Gemma 4 models run well on this GPU. I've gotten a couple laptops with 64Gb RAM (on top of the 16GB VRAM) and 2TB hard drives, for around $1000 - these are kill buys for this performance bracket.

One other laptop I bought recently was a Clevo Lambda Tensorbook:RTX3080 - I've been blown away by this purchase. This machine comes loaded with Ubuntu 24.04 and is listed with a plain old RTX3080 (no ti), but I did some research before purchasing, and this laptop included a special model of the 3080 with 16GB RAM!

The Clevo laptop was actually built specifically for AI workloads, and boy does it rock, for the price. I've even been able to use it to run nvidia-nemotron-3-super-120b-a12b at IQ3_XSS quantization - nevertheless only at 3.7 tokens per second - but that's still actually useful for some knowledge tasks, which are kind of unheard of on inexpensive machines (Nemotron Super is currently one of my favorite medium sized models for knowledge work on inexpensive machines). Gemma-4-26b-a4b with Q4_K_M quantization runs at 26tps, so very usable, qwen3.5-35b-a3b with Q8_0 quant runs at 12tps, which is also very useful (and of course a lower quant will run much faster). Good ole GTP-OSS:20b runs at 56tps, and gemma-4-e2b-it goes 90tps.

For a complete, portable computer which cost only $850 on Ebay, that little Clevo gem has probably been my best GPU enabled purchase for small local AI inference tasks. The Strix Halo ASUS ROG Flow Z13 machine can load significantly bigger models, but it also costs about 3x as much, and Nvidia chips still have some advantages which come from the CUDA stack. So do your research to confirm 16GB, but don't hesitate to snatch one of these up, or any of the 3080 ti mobile models. They can actually be used to get coding, agentic work, and even some knowledge work completed, at a fraction of the cost of other hardware solutions.