What would be the cheapest and most cost-effeciant way of self hosting LLMs
I've a minipc running an AMD 5700U where I host some services, including ollama and openwebui.
Unfortunately the support of rocm isn't quite there yet and not to mention that of mobile GPUs.
Surprisingly the prompts work when configured to use the CPU, but the speed is just... well, not good.
So, what'd be a cheap and energy efficient setup to run sone kind of LLM for personal use, but still get decent speed?
I was thinking about getting an e-gpu case, but I'm not sure about how solid this would end up.
I don't have an answer for you, partly because there isn't enough information about your aims. However, you can probably work this out yourself, compare prices for different hardware. You'd need some of that missing information to run the numbers, though.
I would imagine that an important input here is your expected usage.
If you just want to set up a box to run a chatbot occasionally and you get maybe 1% utilization of the thing, the costs are different from if you intend to have the thing doing batch-processing jobs 24/7. The GPU is probably the dominant energy consumer in the thing, so if it's running 24/7, the compute efficiency of the GPU in terms of energy is going to be a lot more important.
If you have that usage figure, you can estimate the electricity consumption of your GPU.
A second factor here, especially if you want interactive use, is what level of performance is acceptable to you. That may, depending upon your budget and use, be the dominant concern. You've got a baseline to work with.
If you have those figures -- how much performance you want, and what your usage rate is -- you can probably estimate and compare various hardware possibilities.
I'd throw a couple of thoughts out there.
First, if what you want is sustained, 24/7 compute, you probably can look at what's in existing, commercial data centers as a starting point, since people will have similar constraints. If what you care about is much less frequent, it may look different.
Second, if you intend to use this for intermittent LLM use and have the budget and interest in playing games, you may want to make a game-oriented machine. Having a beefy GPU is useful both for running LLMs and playing games. That may differ radically from a build intended just to run LLMs. If you already have a desktop, just sticking a more-powerful GPU in may be the "best" route.
Third, if performance is paramount, depending upon your application, it may be able to make use of multiple GPUs.
Fourth, what applications you want to run may (it sounds like you may have decided on Nvidia already) affect what hardware is acceptable. First, AMD/Nvidia, but also, many applications have minimum VRAM requirements -- the size of the model imposes constraints. Have a GPU without enough VRAM to run what you want to run, and you can't run the model at all.
Fifth, if you have not already, you may want to consider the possibility of not self-hosting at all, if you expect your use to be particularly intermittent and you have high hardware requirements. Something like vast.ai lets you rent hardware with beefy compute cards, which can be cheaper if your demands are intermittent, because the costs are spread across multiple users. If your use is to run a very occasional chatbot and you care a lot about performance and want to run very large models, for example, you could use a system with an H100, for example, for about $3/hour. An H100 costs about $30k and has 80GB of VRAM. If you want to run a chatbot a weekend a month for fun and you want to run a model that requires 80GB -- an extreme case -- that's going to be a lot more economical than buying the same hardware yourself.
Sixth, electricity costs where you are are going to be a factor. And if this system is going to be indoors and you live somewhere warm, you can multiply the cost for increased air conditioning load.
It would the first scenario you described... i'd just interact with a chatbot occasionally like I do with chatgpt now...but I'd also like to try to experiment with copilot like models to test and use with vscode. So no training of models or 24/7 batch operations.
I was wondering whether a custom built gaming PC is the only solution here or if there are other cjeaper alternatives that get the job decently done
Chinese mining motherboard with any xeon cpu (because of abundance of PCI lanes in cpu) and 12x16gb nvidia p100 and nvlink bridges (p40 gives 24gb of vram but doesn't have nvlink capabilities and slower gddr5 memory, p100 have hbm memory on the contrary which is good for llm) and 8x64gb ddr4 ecc ram, if you buy gpus from ebay and other components such as motherboard, CPU, ram, and ssd from AliExpress it comes quite cheap
Aren't those all old-ish sockets with Xeons that top out at like 40 PCIe lanes?
Problem is, new i7/i9 top out at 24 pcie lanes so old xeons is still bang for the buck in homelab sector
Mining boards give about 1-2 pci lanes per gpu because there's a lot of gpus, also look up Nvidia mining gpus, they restricted to pcie x1 or/and x4 even if you put them into x16 slot
Jesus. Kinda overkill depending on how many parameters the model is and the float precision
I have my gaming pc running as ollama host when i need it (RX 6700XT with rocm doing the heavy lifting). PC idles at ~50W and draws up to 200W when generating an answer. It is plenty fast though.
My mini pc home server is running openwebui with access to this "ollama instance" but also OpenAIs api when i just need a quick answer and therefor don't turn on my pc.
I have the exact same gpu and tried that. But couldn't get ollama docker version (rocm) to work with the gpu. even changing the env variable to 10.30.1. (rocminfo reports gfx1031)
would you mind giving some instructions or a link?
If you're lucky you just set it to the wrong version, mine uses 10.3.0 (see below).
I tried running the docker container first as well but gave up since there are seperate versions for cuda and rocm which comes packaged with this as well and therefor gets unnecessary big.
I am running it on Fedora natively. I installed it with the setup script from the top of the docs:
curl -fsSL https://ollama.com/install.sh | sh
After that i created a service file (also stated in the linked docs) so that it starts at boot time (so i can just boot my pc and forget it without needing to login).
The crucial part for the GPU in question (RX 6700XT) was this line under the [service] section:
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"
As you stated, this sets the environment variable for rocm. Also to be able to reach it from outside of localhost (for my server):
Environment="OLLAMA_HOST=0.0.0.0"
oh man... I'm such a dumb dumb .. didn't even try 10.3.0, now I did and the docker version works and is extremely fast, compared to a CPU... Thank you so much.
Glad i could help ;)
For me it was 0.0.0.0:11434
Just a noob question: any advantage of doing this (except privacy) of using thar setup instead of using chatgpt4 from openai website?
You can get different results, sometimes better sometimes worse, most of the time differently phrased (e.g. the gemma models by google like to do bulletlists and sometimes tell me where they got that information from). There are models specifically trained / finetuned for different tasks (mostly coding, but also writing stories, answering medical questions, telling me what is on a picture, speaking different languages, running on smaller / bigger hardware, etc.). Have a look at ollamas library of models which is outright tiny compared to e.g. huggingface.
Also, i don't trust OpenAI and others to be confidential with company data or explicit code snippets from work i feed them.
You can use OpenCL instead of ROCm for GPU offloading. In my tests with llama.cpp that improved performance massively.
Definitely do benchmarks for how many layers you can offload to the GPU. You'll see when it's too many, as performance will crater.
By launching llama.cpp as a server you'll actually be able to continue to use openwebui as you currently have.
I’m using rocm with ollama and it works out of the box on 6900XT
Not sure about the setup, but I believe 7600XT would be the best buy, it has 16GB VRAM and it's supported now
You could try llamaCPP, I think its configured to run better on cpu's
You could also try the ROCm fork of KoboldCpp
Koboldcpp bundles an interface ontop of llamacpp. And generally it's relatively easy to get it running.
i.e. Ollama