AMD Ryzen Threadripper PRO 3995WX Performance
For this exercise, we are using our legacy Linux-Bench scripts which help us see cross-platform “least common denominator” results we have been using for years as well as several results from our updated Linux-Bench2 scripts. At this point, our benchmarking sessions take days to run and we are generating well over a thousand data points. We are also running workloads for software companies that want to see how their software works on the latest hardware. As a result, this is a small sample of the data we are collecting and can share publicly. Our position is always that we are happy to provide some free data but we also have services to let companies run their own workloads in our lab, such as with our DemoEval service. What we do provide is an extremely controlled environment where we know every step is exactly the same and each run is done in a real-world data center, not a test bench. We are also going to share some of our workstation benchmarks from our Lenovo ThinkStation P620 review to highlight how the configuration we tested is similar to a traditional dual Intel Xeon Scalable system.
We are going to show off a few results, and highlight a number of interesting data points in this article.
Python Linux 4.4.2 Kernel Compile Benchmark
This is one of the most requested benchmarks for STH over the past few years. The task was simple, we have a standard configuration file, the Linux 4.4.2 kernel from kernel.org, and make the standard auto-generated configuration utilizing every thread in the system. We are expressing results in terms of compiles per hour to make the results easier to read:
Plenty of cores and TDP means we see a solid figure here. We also get a nice benefit from the additional memory bandwidth.
c-ray 1.1 Performance
We have been using c-ray for our performance testing for years now. It is a ray tracing benchmark that is extremely popular to show differences in processors under multi-threaded workloads. We are going to use our 8K results which work well at this end of the performance spectrum.
This test relies much less on memory bandwidth, and interestingly we see the Threadripper 3990X show a small gain. Logically, this makes sense given we have very similar compute complexes.
7-zip Compression Performance
7-zip is a widely used compression/ decompression program that works cross-platform. We started using the program during our early days with Windows testing. It is now part of Linux-Bench.
While the AMD Ryzen Threadripper 2990WX also had the “WX” moniker, we see here a result that shows a bit of why it was not necessarily a clear winner in the workstation market. Perhaps it was an attempt to get OEMs to bite on AMD platforms for their professional workstations that did not work at the time.
NAMD Performance
NAMD is a molecular modeling benchmark developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. More information on the benchmark can be found here. Here are the comparison results for the legacy data set:
Here we got absolutely great results. There is not much more we can say about this.
OpenSSL Performance
OpenSSL is widely used to secure communications between servers. This is an important protocol in many server stacks. We first look at our sign tests:
Here are the verify results:
One may notice we have been adding a number of different dual Intel Xeon offerings in these charts. We also have the single-socket Intel Xeon W-3275, Xeon W-2295, and Xeon W-1290P just to show those levels of single-socket performance. To be clear, in this market many will have per-core software licensing and not necessarily want a 64-core CPU. Still, for the segment that is scaling to higher-core counts, this is impressive.
UnixBench Dhrystone 2 and Whetstone Benchmarks
Some of the longest-running tests at STH are the venerable UnixBench 5.1.3 Dhrystone 2 and Whetstone results. They are certainly aging, however, we constantly get requests for them, and many angry notes when we leave them out. UnixBench is widely used so we are including it in this data set. Here are the Dhrystone 2 results:
Here are the whetstone results:
We are trying to phase-out these results, but since this is effectively a 2019-2020 era chip, we wanted to still include them here for comparison purposes.
Chess Benchmarking
Chess is an interesting use case since it has almost unlimited complexity. Over the years, we have received a number of requests to bring back chess benchmarking. We have been profiling systems and are ready to start sharing results:
Something that we will note is that in the EPYC 7003 “Milan” generation we get a large speedup here from the bmi2 performance. Since these are launching around the same time as Milan is at hyper-scalers, we double-checked results here to see that it was more in-line with the “Rome” generation.
STH STFB KVM Virtualization Testing
One of the other workloads we wanted to share is from one of our DemoEval customers. We have permission to publish the results, but the application itself being tested is closed source. This is a KVM virtualization-based workload where our client is testing how many VMs it can have online at a given time while completing work under the target SLA. Each VM is a self-contained worker.
Here there are a few key points. First, the higher memory VMs are being impacted more by memory capacity than CPU performance here. This is a workload geared more towards servers where we expect 256GB to be a minimum configuration when we were profiling it. Still, as we move away from the most memory capacity bound side to the more CPU/ memory bandwidth bound side we get a nice speed-up versus the Threadripper 3990X. There is a lot that goes into the performance that includes memory capacity, bandwidth, and CPU performance and it is nice to see all of those showing impact here.
Workstation SPECworkstation 3.0.2
SPECworkstation 3 has been updated to 3.0.2 which measures the 3D graphics performance of systems running under the OpenGL and Direct X application programming interfaces. As a result of the new update, we cannot compare between past version 3 results so we will show the screenshot of the results here and graph them in later reviews.
We just wanted to show the performance of these two workstations. The Dell Precision T7920 William previously reviewed and Dmitrij uses one that we customized for our router/ firewall testing. Here we have Platinum 8260’s which were in our test configuration but as we showed the Intel Xeon Gold 6240R is probably the right Big 2nd Gen Intel Xeon Scalable Refresh SKU.
A Word on Power Consumption
Taking a quick look at power consumption, having a single CPU helps a lot here:
For most of our readers, overall performance, performance per core, or other metrics will likely take precedent over power consumption, but it is still worth noting. The AMD Threadripper 3995WX is not targeting the Xeon W-3275 as much as it is the dual Xeon workstation market.
On that market note, we are going to discuss the market impact followed by our final words.
Great review.
Got mine since Oct end and been using it with 512gb of micron ram. Easily beating a 2P E5 2670v3 by a factor of 5 to 10 on my compute task.
Small form factor and hardly any noise or heat that I can keep under my desk. Have no complaints about it
Any comments on the recently sighted Asus and Gigabyte WRX80 motherboards?
The Threadripper Pro 3995WX is arguably AMD’s fastest multicore chip by being clearly higher clocked than the Epyc 7H12 while its eight memory channels can give it the edge over the vanilla Threadripper 3990X. We’re probably nearly Threadripper 5000 series with CES looming which would leaving Threadripper Pro probably arriving a few months later, after Epyc Milan is formally unveiled (Feb?). I would imagine that OEMs are waiting on the next generation before widely adopting this platform.
I do wish the Threadripper Pro offered some form of overclocking support. While stability is important, many of the tasks that wall into the workstation segment (CPU based video editing, rendering) can weigh that trade off. Even tuning where the base clock is increased but the turbo is lowered would be a viable trade off. This keeps clocks within specification but power draw certainly would not be. Similarly system cooling has to keep up with the added power draw. Though for many use-cases that is a viable trade off. Another area where overclocking would pay off for some users is with memory and being able to climb to DDR4-3600. This not only improves memory bandwidth but also several of the on-die buses run off of this clock for improved performance. Moving to DDR4-4000 invokes a bus ratio change so while raw memory bandwidth increases, overall performance can actually decrease due to lowered clocks else where. I’d be curious if Asus or Gigabyte adopt Threadripper Pro if they’ll support higher memory clocks but it does make sense that an OEM like Lenovo would strictly stick to official specs.
Lastly one feature I was hoping AMD would enable would be raw Infinity Fabric support over PCIe slots to various Radeon graphics cards. AMD has started to leverage Infinity Fabric links on their highend GPUs (Vega 20, MI100). This would not only provide more bandwidth between the CPU and GPU for compute focused workloads but fully coherent memory addressability between the two. That is a huge latency benefit which has traditionally been a bottleneck.
With regards to the P620, I do wish more of the PCIe lanes were put to use in the system. The Threadripper Pro platform with the TRX80 chipset has 136 open lanes available (120 from the CPU, 16 from the chipset). All the slots should be 16x as there are lanes to spare in addition to those used by storage. Various peripherals like audio can hang off of the chipset at lower PCIe lane widths and data rates without compromising their functionality. A single Ethernet connector is surprising even if it supports 10 Gbit. Several workstation use-cases I’ve dealt with have them sit on two separated networks.
Kevin G: the chip overclocks itself. It’s called turbo mode. And since this is the highest configuration, I also think AMD put there the highest capable chip. No need ever, ever to tweak it. AMD did that for you already.
Curious to see if ASRock pump out a ‘server’ board for this too.
– AMD Ryzen™ Thread ripper PRO 3995WX RAM – 256GB