Intel Aurora Passes into the Exa-sphere in the Top500

10
Intel Xe HPC Ponte Vecchio Node Installed At Argonne National Labs For Aurora
Intel Xe HPC Ponte Vecchio Node Installed At Argonne National Labs For Aurora

Today the June 2024 Top500 list was released at ISC 2024. We are going to have our normal Top500 new systems analysis later this week, but Aurora made some big moves. At #2 on this quarter’s list, it was a big accomplishment, and a bit of a head-scratcher at the same time. The headline is that it is now the second HPC cluster to make it into the 1 Exaflop club that we are going to call the Exa-sphere for now.

Intel Aurora Passes into the Exa-sphere in the Top500

For a bit of context, the current iteration of Aurora we have been seeing for several years starting with a Ponte Vecchio GPU at SC19 discussion. Over the years, we have seen the hardware behind Aurora shown in increasing levels of completeness.

HPE Cray Aurora Node Six PVC View Top 1
HPE Cray Aurora Node Six PVC View Top 1

In fact, we even got to do a great piece on the Intel Xeon Max CPU powering the supercomputer’s CPU compute.

Almost 11 months ago, we covered that the Argonne Aurora Supercomputer Intel Max Blade Installation was Complete. In those last 10-11 months, we have seen more nodes come online and achieve greater levels of performance. The system moved from a 0.585 EFLOPS system in November 2023’s list to a 1.012 EFLOPS system on this list. That is still short of Frontier’s 1.206 EFLOPS which means it is good enough for the #2 spot.

While Linpack performance was #2 for the system, the HPL-MxP, a newer mixed-precision benchmark is where Aurora took the top spot. In HPCG it is #3 and Graph500 #5. None of these were using a full system, with the Top500 Linpack running only using about 87% of the installed nodes.

Final Words

At some point, we need to get a bit “real” about what is going on here. At SC23 in November 2023, we got the sense that Argonne was trying for #1 on this list. They did not make it. Putting this into a bit more perspective, this is supposed to be a bigger system than the AMD Frontier MI250X system that took the #1 spot two years ago. The MI250X is now AMD’s previous generation part since we saw the AMD Instinct MI300X GPU and MI300A APUs Launched for the AI Era months ago, and the MI300X is on track to be a multi-billion dollar part in 2024.

At the same time, if you worked for a hyperscaler that purchased 63,744 NVIDIA H100 GPUs for an AI cluster, had them installed for 11 months, and they were not running intended AI workloads, it would be hard to call that a win. HPC and AI clusters are different. At the same time, Linpack is not exactly a new application.

From a generational perspective, Intel had planned to have Rialto Bridge this year as a H100 competitor, and then Falcon Shores next year as a Grace Blackwell and MI300A-next competitor. Now, Rialto Bridge is canned along with the CPU+GPU 2025 Falcon Shores, making Falcon Shores a GPU-only part.

It is starting to feel like #1 the Argonne-Intel-HPE team (Slingshot-11 interconnect is being used) needs to make sure that in November 2024 they are at full performance with El Capitan coming. #2, and perhaps most importantly, it almost feels like Falcon Shores is becoming a “bet the farm” product for Intel’s GPU and AI aspirations.

At the same time, at over 1 EFLOPS, Aurora is a huge system and congratulations are in order just for making it to the Exa-sphere.

10 COMMENTS

  1. It’s been years for Aurora and it’s still not completely functional while Frontier was fully up and running in less than a year from its TDS deployment.
    The efficiency of Aurora is also abysmal when compared to Frontier (which delivered first test nodes just 6 months after Aurora’s), and especially to new NVIDIA ARM designs.
    It’s a catastrophe for Intel’s PR.

  2. It’s much worse than a “headscratcher”. The efficiency is catastrophically bad, which is one of the primary metrics in HPC.
    Look at the Green 500 and you’ll see majority Nvidia GPU based in the top 10, with a majority of those being even Grace powered ones already. A few AMD Instinct after that.
    In HPCG, which is supposed to be more realistic of a test, it is beat by Fugaku, a 4 years old system, that yes isn’t as efficient as the newer Nvidia & AMD systems, but somehow uses ~9kW less than Aurora.

    Let’s face the facts. Intel has failed with most of its GPU & accelerators, and the market is rewarding them for it

  3. Aurora is terribly inefficient from a power draw perspective compared to Frontier. Aurora is 19% slower than Frontier all while using 70% more power. Now it is possible that performance will go higher with more tuning, however, Aurora is already late, not hitting performance, AND having a lot of problems. Overall this has been a can of worms for Intel the entire time.

  4. You buy Ponte Vecchio in 2024 because you can’t get MI300X and don’t want to pay for the H200.

  5. Delivered. Not fully operational yet.
    They don’t mean the same things.
    Complete installation was always expected to be done in mid 2024

  6. It seems like you guys have never heard of China. There are at least 3 exascale machines there, but STH seems to think there are only two exascale machines in existence anywhere. Why is this?

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.