Avoiding the Pitfalls of Big AI Inference on a Virtual Machine Host
There were quite a few times when we setup tests that ended up in the 0.1 tokens per second range. They ran, but we lost an order of magnitude on the performance side. Those tests tended to be because we tried configurations that were not well-mapped to the underlying hardware.
The first big one was failing to properly map the application to the underlying NUMA architecture. A server like this has two AMD EPYC 9965 CPUs for 384 cores total, but each 192 core CPU has direct access to only 768GB of memory or so (12x 64GB DDDR5 DIMMs.) We tried the configuration where we were using memory from CPU0 with the cores from CPU1 and that, as expected, performed poorly.

What performed better was running workloads on a single CPU. We could run smaller models on a single CPU while the other was used for general purpose virtualization and then we were not getting contention on the memory side. This worked well. With 128GB DDR5 DIMMs, a single CPU can have the capacity to run the Deepseek-R1 671B FP16 model. The 1.5TB server was good, but a 3TB server would be immensely more flexible for using as a virtualization host plus AI server.
We already discussed running HPC VMs that use a lot of memory bandwidth alongside this. That goes for any VMs that are limited by memory bandwidth.
Perhaps the other interesting one was SMT. When we ran our SMT testing for the recent The Key Feature for Server CPUs is Still Two Threads Per Core piece, turning SMT off and letting Ollama run with a default number of threads absolutely shredded our ability to have a virtualized setup alongside VMs. Response latencies on the other VMs spiked, and performance of the 671B FP16 model dropped. Ollama seems to use physical cores instead of the SMT cores, but SMT=off plus additional VMs dropped our performance by an order of magnitude.

Another smaller one was what happened when we started running multiple models simultaneously on the server. We got better performance on smaller models by pinning them to a specific CPU (CPU0 or CPU1) than running them on the same CPU because of the memory contention. This makes sense from an architecture standpoint and we are focused on the large FP16 model. At the same time, folks will naturally see this and want to run smaller and faster models. That is possible, but the memory bandwidth becomes a key limiter.
Final Words
The server we are working on is clearly a high-end virtualization server, not a bargain-basement platform. At the same time, many organizations are deploying $350-400K AI servers just to start using this class of large models and we can do that here with a server that is both 1/10th the cost, but that is also delivering value running other applications. Some of those high-value servers actually do not have enough GPU memory across eight GPUs to run this large of a model in GPU memory.

This is clearly not the answer if you want to compete in agent-to-agent transactions, or with large-scale AI providers. On the other hand, for many of our readers who have servers from Q1 2021 and earlier, which are mostly 1st and 2nd Gen Intel Xeon Scalable in the installed base, this is a really neat way to not just pitch a 7:1 or 12:1 server consolidation project, but also to add AI inference capabilities with the upgrade. It turns out that having more memory and cores are good for both virtualization and for running these large AI models that are using over 1.2TB of memory each.
For others, if you have extra server CPU capacity and memory available, especially in the current generation of 12-channel DDR5 servers, this is a “free” way to start working with new large AI models without having to sacrifice quality by running smaller models or heading down the quantization path.
With the stock market crash today, I’d wager that there’s lots of us who will only experience high-end AI like this….. at least in the near term
Excellence STH
I don’t think you’re grasping just how big this is. I wish AMD did AMX too, but I followed the same steps just running llama3.3 instead and it took exactly 4m 21s. Now I can say I’ve built an AI tool by using our ESXi virt cluster
count me in for the newly minted dozen vms with 70b models going head2head now
good work
Well done guys!
STH is and has been for a long time my absolute favourite Tech Site!
I love hardware so this is my cup of tea!
Run an FP8 on one CPU and its memory, that would have been interesting to see performance wise.
For a setup like this you one could use QEMU’s ability to map physical characteristics of the underlying CPU including L3, and to pass NUMA structure along. With this the OS in the VM will be able to make sensible decisions with thread placement.
This requires digging into QEMU’s commandline, which can be easily achieved with Proxmox’s VM configuration files (args:).
SMT needs to be correctly mapped by passing in the relevant SMP configuration: -smp 128,sockets=2,cores=64,threads=2,maxcpus=128
The above requires binding each vCPU to the host thread, and it has to be done after the VM starts up. For Proxmox the hookscript: directive can achieve this. Unfortunately it takes quite a bit of digging to get the exact QEMU process ids. Example scripts can be found on Proxmox’s forum.
Topology needs to be translated, together with host CPU physical bits and hugepages, while host should pass all of the relevant flags it needs to be verified in the VM (stuff like mitigations): -cpu host,phys-bits=0,topoext=on,+pdpe1gb
Unfortunately QEMU is not able to directly map NUMA to the guest. Even if presented with a single socket the underlying configuration might still utilize NUMA via NPS or SNC. For this it’s optimal to hard-reserve memory in each NUMA node and bind it to the range of CPUs, for example bind 128GB for first 16 vCPUs: -object memory-backend-ram,id=ram-node0,size=131072M,host-nodes=0,policy=bind,prealloc=on -numa node,nodeid=0,cpus=0-15,memdev=ram-node0
The above has to be repeated and adjusted to the host NUMA configuration. If using multiple sockets the distance between NUMA nodes also has to be specified once per direction: -numa dist,src=0,dst=1,val=11
It’s best to just use what numactl -H shows.
Sadly that’s a lot of manual tuning, which could be theoretically introduced into Proxmox or QEMU itself (but that is a really hard problem to get right for every configuration). There’s also further advanced tuning like moving interrupt handlers, kernel threads and other processes onto dedicated host CPUs that won’t be used by the VM, as to minimize cache thrashing from irrelevant (from the VM’s perspective) uses. QEMU IO threads should also be bound to the NUMA node containing underlying storage. If possible networking should be done with a NIC VF passed directly to the VM, also bound to VM NUMA configuration (via -device pxb for PCI birdge bound to VM NUMA node, and -device vfio-pci for the NIC VF attached to said pxb device).
With this I was able to achieve ~ 95% performance for running huge LLMs in a VM at the price of the VM having a rigid configuration, as scaling resources isn’t trivial and requires recalculating parameters.
Hope this wall of text can be useful for someone to tinker with, or even a more advanced STH article :)
1.9 token/s? Why bother…
DeepSeek is natively trained at 8 bit, 16 or 32 bit versions of the model are just for compatibility in case you accelerator/tools only support that format.By doing so in this case you just waste half the RAM capacity and memory bandwidth for no gain in quality like a natively fp16 trained model would have
I really appreciate the “all CPU” approach, but it’s possible to use LM Studio, FlexLLMGen and Genv with a single consumer graphics card; like a 5090, to get much faster performance.
I’d go with one CPU and 12x96GB RDIMMs (holds 8bit model) or the 12x128GB (holds the 16 bit model) if you can afford it. Dual CPUs slow down a fundamentally sequential inferencing process substantially. Using a relatively small 48GB GPU for kv cache / context buffers and letting the CPU and RAM handle the model layers should improve performance a bit.