Running the Deepseek-R1 671B Model at FP16 Fidelity Alongside Virtualized Workloads

April 3, 2025

Avoiding the Pitfalls of Big AI Inference on a Virtual Machine Host

There were quite a few times when we setup tests that ended up in the 0.1 tokens per second range. They ran, but we lost an order of magnitude on the performance side. Those tests tended to be because we tried configurations that were not well-mapped to the underlying hardware.

The first big one was failing to properly map the application to the underlying NUMA architecture. A server like this has two AMD EPYC 9965 CPUs for 384 cores total, but each 192 core CPU has direct access to only 768GB of memory or so (12x 64GB DDDR5 DIMMs.) We tried the configuration where we were using memory from CPU0 with the cores from CPU1 and that, as expected, performed poorly.

AMD EPYC 9965 Topology — Amd Epyc 9965 Topo Smt On

What performed better was running workloads on a single CPU. We could run smaller models on a single CPU while the other was used for general purpose virtualization and then we were not getting contention on the memory side. This worked well. With 128GB DDR5 DIMMs, a single CPU can have the capacity to run the Deepseek-R1 671B FP16 model. The 1.5TB server was good, but a 3TB server would be immensely more flexible for using as a virtualization host plus AI server.

We already discussed running HPC VMs that use a lot of memory bandwidth alongside this. That goes for any VMs that are limited by memory bandwidth.

Perhaps the other interesting one was SMT. When we ran our SMT testing for the recent The Key Feature for Server CPUs is Still Two Threads Per Core piece, turning SMT off and letting Ollama run with a default number of threads absolutely shredded our ability to have a virtualized setup alongside VMs. Response latencies on the other VMs spiked, and performance of the 671B FP16 model dropped. Ollama seems to use physical cores instead of the SMT cores, but SMT=off plus additional VMs dropped our performance by an order of magnitude.

AMD EPYC 9965 Server Ollama Deepseek-R1 671b SMT Off

Another smaller one was what happened when we started running multiple models simultaneously on the server. We got better performance on smaller models by pinning them to a specific CPU (CPU0 or CPU1) than running them on the same CPU because of the memory contention. This makes sense from an architecture standpoint and we are focused on the large FP16 model. At the same time, folks will naturally see this and want to run smaller and faster models. That is possible, but the memory bandwidth becomes a key limiter.

Final Words

The server we are working on is clearly a high-end virtualization server, not a bargain-basement platform. At the same time, many organizations are deploying $350-400K AI servers just to start using this class of large models and we can do that here with a server that is both 1/10th the cost, but that is also delivering value running other applications. Some of those high-value servers actually do not have enough GPU memory across eight GPUs to run this large of a model in GPU memory.

This is clearly not the answer if you want to compete in agent-to-agent transactions, or with large-scale AI providers. On the other hand, for many of our readers who have servers from Q1 2021 and earlier, which are mostly 1st and 2nd Gen Intel Xeon Scalable in the installed base, this is a really neat way to not just pitch a 7:1 or 12:1 server consolidation project, but also to add AI inference capabilities with the upgrade. It turns out that having more memory and cores are good for both virtualization and for running these large AI models that are using over 1.2TB of memory each.

For others, if you have extra server CPU capacity and memory available, especially in the current generation of 12-channel DDR5 servers, this is a “free” way to start working with new large AI models without having to sacrifice quality by running smaller models or heading down the quantization path.

10 COMMENTS

viktor April 3, 2025 At 4:31 pm

With the stock market crash today, I’d wager that there’s lots of us who will only experience high-end AI like this….. at least in the near term

Excellence STH
ash April 3, 2025 At 5:07 pm

I don’t think you’re grasping just how big this is. I wish AMD did AMX too, but I followed the same steps just running llama3.3 instead and it took exactly 4m 21s. Now I can say I’ve built an AI tool by using our ESXi virt cluster
GeorgeH April 3, 2025 At 7:27 pm

count me in for the newly minted dozen vms with 70b models going head2head now

good work
Ibrahim April 4, 2025 At 12:44 am

Well done guys!
STH is and has been for a long time my absolute favourite Tech Site!
I love hardware so this is my cup of tea!
Brad April 4, 2025 At 7:02 am

Run an FP8 on one CPU and its memory, that would have been interesting to see performance wise.
Kyle April 4, 2025 At 8:47 am

For a setup like this you one could use QEMU’s ability to map physical characteristics of the underlying CPU including L3, and to pass NUMA structure along. With this the OS in the VM will be able to make sensible decisions with thread placement.
This requires digging into QEMU’s commandline, which can be easily achieved with Proxmox’s VM configuration files (args:).
SMT needs to be correctly mapped by passing in the relevant SMP configuration: -smp 128,sockets=2,cores=64,threads=2,maxcpus=128
The above requires binding each vCPU to the host thread, and it has to be done after the VM starts up. For Proxmox the hookscript: directive can achieve this. Unfortunately it takes quite a bit of digging to get the exact QEMU process ids. Example scripts can be found on Proxmox’s forum.
Topology needs to be translated, together with host CPU physical bits and hugepages, while host should pass all of the relevant flags it needs to be verified in the VM (stuff like mitigations): -cpu host,phys-bits=0,topoext=on,+pdpe1gb
Unfortunately QEMU is not able to directly map NUMA to the guest. Even if presented with a single socket the underlying configuration might still utilize NUMA via NPS or SNC. For this it’s optimal to hard-reserve memory in each NUMA node and bind it to the range of CPUs, for example bind 128GB for first 16 vCPUs: -object memory-backend-ram,id=ram-node0,size=131072M,host-nodes=0,policy=bind,prealloc=on -numa node,nodeid=0,cpus=0-15,memdev=ram-node0
The above has to be repeated and adjusted to the host NUMA configuration. If using multiple sockets the distance between NUMA nodes also has to be specified once per direction: -numa dist,src=0,dst=1,val=11
It’s best to just use what numactl -H shows.
Sadly that’s a lot of manual tuning, which could be theoretically introduced into Proxmox or QEMU itself (but that is a really hard problem to get right for every configuration). There’s also further advanced tuning like moving interrupt handlers, kernel threads and other processes onto dedicated host CPUs that won’t be used by the VM, as to minimize cache thrashing from irrelevant (from the VM’s perspective) uses. QEMU IO threads should also be bound to the NUMA node containing underlying storage. If possible networking should be done with a NIC VF passed directly to the VM, also bound to VM NUMA configuration (via -device pxb for PCI birdge bound to VM NUMA node, and -device vfio-pci for the NIC VF attached to said pxb device).
With this I was able to achieve ~ 95% performance for running huge LLMs in a VM at the price of the VM having a rigid configuration, as scaling resources isn’t trivial and requires recalculating parameters.

Hope this wall of text can be useful for someone to tinker with, or even a more advanced STH article :)
Jean-Claude April 4, 2025 At 11:22 pm

1.9 token/s? Why bother…
Daniel April 5, 2025 At 3:54 am

DeepSeek is natively trained at 8 bit, 16 or 32 bit versions of the model are just for compatibility in case you accelerator/tools only support that format.By doing so in this case you just waste half the RAM capacity and memory bandwidth for no gain in quality like a natively fp16 trained model would have
Rob April 5, 2025 At 8:40 am

I really appreciate the “all CPU” approach, but it’s possible to use LM Studio, FlexLLMGen and Genv with a single consumer graphics card; like a 5090, to get much faster performance.
Greg April 5, 2025 At 9:12 pm

I’d go with one CPU and 12x96GB RDIMMs (holds 8bit model) or the 12x128GB (holds the 16 bit model) if you can afford it. Dual CPUs slow down a fundamentally sequential inferencing process substantially. Using a relatively small 48GB GPU for kv cache / context buffers and letting the CPU and RAM handle the model layers should improve performance a bit.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Avoiding the Pitfalls of Big AI Inference on a Virtual Machine Host

Final Words

RELATED ARTICLESMORE FROM AUTHOR

The IBM z17 Mainframe Brings AI with Telum II and Spyre

ASUS AI Pod NVIDIA GB300 NVL72 at NVIDIA GTC 2025

Can You Run the 94GB NVIDIA H100 NVL PCIe as a Single GPU

10 COMMENTS

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR