AMD Milan-X: Performance and Power Consumption
Here we are going to take a look at two different aspects. First, what the technical computing/ simulation/ HPC community thinks about this. Second, what we have found testing the chips ourselves. Then, we are going to bring it together and simply discuss how you can think about Milan-X for your organization. To me, that is actually the more impactful exercise having seen our data and third-party data.
AMD was pushing the concept of Super-Linear scaling. The basic idea is that if you can scale-out and keep data warm and on cache next to cores, you can get application speedups that exceed just by adding additional CPUs.
This is something several ISV’s have shown, but it is not necessarily something that all applications will show, and not even close.
Normally we do not go into partner benchmarks, but the target market for Milan-X is one where we do not do a lot of benchmarking, and the applications are very expensive also require expertise to set up. Since the software vendors are experts in their applications, and these are big applications, we are going to show their thoughts on Milan-X.
Microsoft Azure HPC with AMD Milan-X
Microsoft Azure has a special HPC cloud. Perhaps the most shocking part when we covered AMD Milan-X previously was that Microsoft is taking the existing HBv3 instance hardware and swapping in Milan-X CPUs without changing the instance type. Microsoft saw such huge gains that it is doing a rip-and-replace upgrade of current-generation EPYC 7003 CPUs. That is perhaps the best testimonial to the impact Microsoft and its cloud HPC customers see from the new chips. Also, the F1_racecar_140m model you are going to see a lot of.
Here is Azure’s super-linear scaling due to a higher portion of hot data being cached in L3. This is using that racecar Ansys Fluent simulation.
Here is what the new chips will do to the Azure HBv3 instances. I think Microsoft upgraded its west coast site, then Europe, then east cost and it should all be complete for this launch or soon from what we have heard.
Again, up to 1.5GB of L3 cache (across two sockets) per system is awesome. We are in the Gigabyte Era of CPUs.
Siemens Simcenter STAR CCM+ with AMD Milan X
Siemens Simcenter STAR CCM+ is a multi-physics platform that is used in many engineering shops.
Engineers use it to see how systems will work without having to build physical prototypes.
Here, on Microsoft Azure, Siemens is seeing super-linear scaling. We will quickly note that this is not an example on Microsoft’s slide above, nor on AMD’s, this is net-new.
We are going to see more of these, but Siemens is talking about the AMD EPYC 7V73X here. That is Microsoft Azure’s custom Milan-X SKU and that is why it is not one of the four SKUs we listed above.
Altair AcuSolve and Radioss with AMD Milan X
Altair application provider in the simulation space. AcuSolve is for problems like simulating airflow.
Altair shows that simply flipping the switch to enable 3D V-Cache (it can be disabled in BIOS) adds 5-40% more performance with AcuSolve.
Bigger gains are with Altair Radioss for crash simulation.
Altair claims 10-80% increased performance.
For those wondering “Neon” is a lower complexity model of a Dodge Neon automobile. As a smaller model, more fits into the cache so there is an 80% speedup there. The Ford Taurus example is much larger, so less is being cached which is why that is a 10% speedup with the additional cache.
This is excellent. I’m excited to get a r7525 with these and try them out. I sent this to my boss this morning and he OK’d ordering one so we can do profiling on our VMware servers
@cedric – make sure you order it with all the connectivity you’ll ever want. Dell has been a bunch of [censored] when we’ve opened cases about bog-standard Intel X710 NICs not working correctly in our 7525s. So much for being an open platform.
Not that I’m bitter.
Why the 7373X with 16c is more expensive than the 7473X with 24c ?
flo it’s because it has fewer cores so its cheaper to license.
Now that the 7003x “shipping”, perhaps they can get around to shipping the 7003 in bulk. I’ve got orders nearly 9 months old.
While per-core licensing costs seem to be a consideration for some people, I think this kind of optimisation is only possible because certain proprietary licensing models need updating to account for modern computer hardware. Given the nonlinear scaling between frequency and power consumption, it appears environmentally backwards to base hardware choices on weird software licensing costs rather than performance per watt or something similar that neglects arbitrary licensing constraints.
On another note, NOAA open sourced their weather forecasting codes a few years ago and WRF (based on models developed by NCAR) has been open source for much longer. I think the benchmark problems associated with these applications would make for an interesting journalistic comparison between new server CPUs with larger cache sizes.
@Eric – Environmentally backwards, possibly, but so often the hardware platform is the cheapest part of the solution – at least in terms of capital costs. I don’t think it’s necessarily unreasonable to optimize for licensing costs when the software can easily dwarf the hardware costs–sometimes by multiple orders of magnitude. To your point though, yes, the long-term operational expense, including power consumption, should be considered as well.
The move to core-based licensing was largely a response to increasing core counts – per-socket licensing was far more common before cores started reaching the dozen+ level. Hopefully you’re not advocating for a performance/benchmark based licensing model…it’s certainly been done (Oracle).
Lilu Dallas multipass :)
ok, so Taylor Swift has a Cache of 768Mbytes? I never woulda guessed :).
I find the speedups in compilation a bit underwhelming. My hunch is that the tests are performed the usual way – each file as a separate compilation unit. I work on projects with tens of thousands of C++ files and the build system generates files that contain includes for the several hundred cpp files each and then compiles those.
When you have a complicated set of header files, just parsing and analyzing the headers takes most of the compilation time. When you bunch lots of source files together you amortize this cost. I guess in such scenario the huge L3 cache would help more than for a regular file-by-file build.
Dell submitted SAP standard application benchmark, comparing 2 x AMD EPYC 7773X with 2 x AMD EPYC 7763 processor. Milan-x is able to show 14% increased in benchmark while serving 11,000 more users.
https://www.sap.com/dmc/exp/2018-benchmark-directory/#/sd?filters=v:4a9e824336e2837bf9081e423d579859;v:4a9e824336e2837bf9081e423d58f740;v:4a9e824336e2837bf9081e423d5a63e6
What happened to the Stockfish Chess Benchmark?