AMD Milan-X Delivers AMD EPYC Caches to the GB-era

12

Market Impact and Our Guide

Perhaps the most interesting part then is simply trying to describe the AMD EPYC market these days. The EPYC 3000 series is a different segment, and the EPYC 7001 series is at its replacement cycle. On the other hand, we have Milan-X, Milan, frequency optimized Milan, and still some EPYC 7002 series “Rome” in the market. We actually purchased Rome Lenovo and HPE servers about three months ago so Rome is certainly still out there.

STH AMD EPYC Buyers Guide Q1 2022 Web
STH AMD EPYC Buyers Guide Q1 2022 Web

Above, we have our recommendations. Rome will be the lower-cost and lower-power options that we see used in some applications. Milan is faster, but sometimes that is not the optimization point. The mainstream EPYC 7003 parts are likely to be the big volume parts spanning many core count, power, and price points. The frequency optimized F-SKUs and cache-optimized Milan-X SKUs cost more, and use more power, but they are optimized for performance, especially when one can derive software license savings through higher performance cores. The task is then to match these specialized SKUs to their workloads. For that, AMD has some tools:

AMD Milan X How To Monitor L3 And WSS
AMD Milan X How To Monitor L3 And WSS

We asked AMD and they said that the uProf tool (pronounced “Micro Prof”) is designed to be AMD-only at this point. So it will not help you profile a system running on an Intel machine in your environment. The WSS (working set size) pearl scripts are designed to help understand how big an application is. This is important to gauge how much can be cached by larger L3 caches.

We will likely have more on these tools in the future, but for now, there is some direction and a challenge. It seems like many of the engineering applications we looked at run well on Milan-X, as we would expect from vendors AMD wanted to highlight at its event. There is certainly a trade-off in terms of clocks and cache. As a result, the challenge is the application, the data, and how they are used in actual workloads all matter for which CPU to recommend. One could say Milan-X with the worst case likely being losing 10-15% performance due to clock speeds. The real work is figuring out if you can benefit from additional cache today, and perhaps more importantly, in the future years of a server’s hardware lifecycle.

Final Words

AMD is doing something really interesting with Milan-X. It is essentially kicking off the mainstream GB-era of server chips at 0.75GB per chip, and 1.5GB of L3 cache in a dual-socket server, it feels that way. With many new technologies, the question is whether software will take advantage of the new capabilities. This is actually a question that AMD has a very different answer for.

AMD EPYC 7773X Stack 1
AMD EPYC 7773X Stack 1

We have tested a number of CPUs with things like AI accelerators for inferencing that companies like Intel are utilizing, or video codecs on Apple Silicon chips. Accelerators require code changes to utilize. Bigger caches are, well, bigger caches. Systems can already utilize them with no explicit code changes. Usage requires the right type of application and data to really pay off, but when it does, the gains can be enormous. Scale-wise, these gains are enough to have a major cloud provider rip out current-generation EPYC 7003 Milan parts for Milan-X variants. As larger cache sizes become the norm, perhaps software is written to better utilize the caches, but there is no manual intervention needed today.

AMD Milan X 3D V Cache Stacked With Background
AMD Milan X 3D V Cache Stacked With Background

Milan-X is a jolt for the entire AMD EPYC 7003 series. In some or many cases, the standard or frequency optimized parts are going to be better fitted. For HPC workloads that have high software costs, and can utilize the cache effectively, Milan-X can be a game-changer. There are other workloads out there that will be the same, and Milan-X will be a tool to find those workloads. Most servers deployed in data centers today have tens of megabytes of L3 cache, so discussing servers with 1.5GB of cache seems almost unfathomable. The future is bright for the GB-era of server CPUs.

In 2022, AMD has already promised Genoa at 96 Cores so Milan-X is a mid-cycle expansion of the current line that makes it that much more interesting between now and Genoa. For the record, in this article, we never explicitly said there would be a Genoa-X, but it would be exciting to see AMD continue using its 3D V-Cache technology beyond Zen 3.

12 COMMENTS

  1. This is excellent. I’m excited to get a r7525 with these and try them out. I sent this to my boss this morning and he OK’d ordering one so we can do profiling on our VMware servers

  2. @cedric – make sure you order it with all the connectivity you’ll ever want. Dell has been a bunch of [censored] when we’ve opened cases about bog-standard Intel X710 NICs not working correctly in our 7525s. So much for being an open platform.

    Not that I’m bitter.

  3. Now that the 7003x “shipping”, perhaps they can get around to shipping the 7003 in bulk. I’ve got orders nearly 9 months old.

  4. While per-core licensing costs seem to be a consideration for some people, I think this kind of optimisation is only possible because certain proprietary licensing models need updating to account for modern computer hardware. Given the nonlinear scaling between frequency and power consumption, it appears environmentally backwards to base hardware choices on weird software licensing costs rather than performance per watt or something similar that neglects arbitrary licensing constraints.

    On another note, NOAA open sourced their weather forecasting codes a few years ago and WRF (based on models developed by NCAR) has been open source for much longer. I think the benchmark problems associated with these applications would make for an interesting journalistic comparison between new server CPUs with larger cache sizes.

  5. @Eric – Environmentally backwards, possibly, but so often the hardware platform is the cheapest part of the solution – at least in terms of capital costs. I don’t think it’s necessarily unreasonable to optimize for licensing costs when the software can easily dwarf the hardware costs–sometimes by multiple orders of magnitude. To your point though, yes, the long-term operational expense, including power consumption, should be considered as well.

    The move to core-based licensing was largely a response to increasing core counts – per-socket licensing was far more common before cores started reaching the dozen+ level. Hopefully you’re not advocating for a performance/benchmark based licensing model…it’s certainly been done (Oracle).

  6. I find the speedups in compilation a bit underwhelming. My hunch is that the tests are performed the usual way – each file as a separate compilation unit. I work on projects with tens of thousands of C++ files and the build system generates files that contain includes for the several hundred cpp files each and then compiles those.

    When you have a complicated set of header files, just parsing and analyzing the headers takes most of the compilation time. When you bunch lots of source files together you amortize this cost. I guess in such scenario the huge L3 cache would help more than for a regular file-by-file build.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.