Post History

Current VersionJun 09, 2026 at 22:32

I have clients who use their local LLM server machines to power agents on remote client machines, and they require a truly scalable solution to serve multiple API services to many connected client machines.

For my own needs, I also regularly travel between multiple locations, and have LLM servers set up at several different physical sites. I use the same solution to access LLM APIs on all those machines (each of which is typically configured to serve a different collection of LLM models).

Remote desktop to manage the servers

To manage the servers remotely (not to power client API connections), I use Ruskdesk, to share 3 different Linux desktops, and Chrome Remote Desktop to access 4 Windows server machines.

Rustdesk is great because it's open source and doesn't require any 3rd party services whatsoever - I keep a lightweight discovery server set up on one of my inexpensive VPS accounts - and it works on machines where Chrome can't be installed, such as the Asus GX10s (and any version of the DGX Spark machines, which come with Nvidia's own version of Ubuntu for Arm, which currently doesn't support Chrome).

For inference, however, desktop sharing is just a quick hack that lets you log in and run a chat interaction. It doesn't help when you need to provide inference to a number of users, each running agents over an API connection on their local PCs.

LM Link is fine for small environments

LM Studio does make their proprietary LM Link sharing service available for free, which can be used on up to 5 machines. It's a convenient, simple solution which works well in private environments, but to use it you need to install LM Studio on any remote machine where you want to connect to the API served on another computer - and that 5 machine limit is a no-go for me.

Sharing needs to work without port forwarding

One of my issues is that the most used servers in my stable are at sites where T-mobile Home Internet is the ISP, and they don't enable port forwarding (plus port forwarding is messy and can potentially open up more surface area for security threats).

Ngrok may be useful for quick tests and ephemeral configurations

Ngrok is a popular solution for these sorts of situations, but for the number of connections I need, and for the ability to control domain names, to maintain persistent connections, etc., I'd need to pay for a tier of service which is fine for low usage requirements, but which make Ngrok not so ideally suited for clients who have more users and more demanding requirements to control multiple service connections.

Cloudflared is the best solution I've found

My solution has been to use Cloudflared. I love it. To use it, you just need a single domain name hosted with Cloudflare.

To configure Cloudflared:

Buy a single domain from Cloudflare(just a few dollars per year)
Log into your Cloudflare dashboard at https://dash.cloudflare.com
Select: Networking > Tunnels > Create Tunnel. This operation should be performed in a browser, directly on each server machine where you want to share an API. Enter a name for the tunnel (perform this configuration step in your browser at https://dash.cloudflare.com, on each server machine).
Download the 'Cloudflared' app for your server's OS/architecture, install it, and run the provided command line in your OS terminal, to start the Cloudflared service.
Set up a 'route' by providing a subdomain name for the particular application host and port you want to share (make up a name for the service you want to share) - you can add this service as a subdomain of any URL you have hosted with Cloudflare. For example, LM Studio typically serves it's API at http://localhost:1234. That URL should be entered as the 'Service URL' value in your Cloudflared route configuration (you perform this configuration step in your browser at https://dash.cloudflare.com, on each server machine).

That's it. Your API is now available to any remote machine. The Cloudflared app running on each server tunnels that server's connection from any local area network, to a subdomain which is available anywhere on the Internet. And you can change the local IP address of the machine, move the server to a different local WIFI connection, move it to an entirely different physical location, etc., without having to reconfigure anything.

You can now use any of the connected API routes, in any of your agents, on any remote client machine. For example, my Pi models.json file on a remote client PC will include something like the following, where 'mydomain' is the domain I have hosted with Cloudflare, and 'mysubdomain' is the subdomain I created in the 5th step above:

{
  "providers": {
    "lmstudio": {
      "baseUrl": "https://mysubdomain.mydomain.com/v1",
      "api": "openai-completions",
      "apiKey": "lm-studio",
      "compat": {
        "supportsDeveloperRole": false
      },
      "models": [
        {
          "id": "qwen3.6-27b-mtp@q6_k_xl",
          "reasoning": true,
          "compat": {
            "thinkingFormat": "qwen"
          }
        },
        {
          "id": "google/gemma-4-31b-qat",
          "input": [
            "text",
            "image"
          ]
        },
        {
          "id": "step-3.7-flash@iq3_xxs"
        },
      ]
    },
  }
}

Of course, you can prompt Pi to search for and add any remote models you've configured with Cloudflared. I used the prompt:

I have LM Studio API running at http://localhost:1234 on a remote Asus GX10 computer which has Cloudflared running. I have a route set up to that application at https://mysubdomain.mydomain.com Please add the models in that LM Studio instance, to the local install of Pi.

I added models from additional servers by running this prompt for every remote server (this runs quickly - I configured 6 servers right away):

I have another instance of LM Studio running at https://mysubdomain.mydomain.com/ Please add the models available on that server to the list of models available to this installation of Pi, without removing the models already configured (some models are the same on that server, perhaps we can rename them with a prefix of 'mysubdomain')

Then I used the following prompt to collect all that work into a single .json file which I can copy to any remote machine:

please copy the config file(s) needed to install these models into the Pi environment on other machines, to my downloads directory

And I went a step further to create a standalone prompt which I can simply paste into any running instance of Pi on any client computer:

Please generate a standalone prompt that can be used to install these models into Pi on other machines

With that single generated prompt, I can now go to any computer which has Pi installed, run the prompt, and I have direct API access to every single model installed on every one of my 6 servers, all running at different locations.

It just takes a few seconds on each new client machine. I even ran that prompt in Pi on my Android phone 😎 This is an operation even my non-technical clients have been able to perform (run CMD, type 'Pi', paste the prompt).

There are other options besides Cloudflared, which include using a totally self-hosted service on a VPS, to perform routing, but I trust Cloudflare's service to scale well and to reliably perform fast.

As far as I see, you can run as many Cloudflared instances, and as many subdomain routes, to as many server apps as you want, all using a single domain, all for free once you have a domain registered with Cloudflare, so this ends up being a ridiculously inexpensive and scalable solution for my clients who have demanding usage requirements. Just buy one domain from Cloudflare, and you've got unlimited self-hosted LLM API routing for all your AI servers/apps.

To be clear, the example above focused on LM Studio, but you can route the output of any application (including any other non-AI oriented service, on any computer), using Cloudflared. So set up LlamaCPP, vLLM, etc. and route any inference APIs using this technique, for any server you have set up, for any purpose. It's a wickedly performant, simple, and inexpensive solution.

Version 7Jun 09, 2026 at 22:32