This is Intel Gaudi 3 the New 128GB HBM2e AI Chip in the Wild

5
Intel Gaudi 3 OAM Working Sample Package 1
Intel Gaudi 3 OAM Working Sample Package 1

At Intel Vision 2024, we saw the Intel Gaudi 3 AI accelerator. This is Intel’s newest dedicated AI accelerator in a very hot market. The new Gaudi 3 is a big leap over the previous generation Gaudi 2, which we have seen before, and is set to go into volume production later in 2024.

As a quick note: I attended the Intel Vision 2024 as an “Influencer.” The press was not invited, but analysts were. The event is being held about 25 min from the STH studio in Scottsdale, so that is how I got a badge. If you want the sub-30 second short on our new shorts channel, here you go:

Intel Gaudi 3 128GB HBM2e AI Chip in the Wild

Of course, most folks like to see specs, and the Gaudi 3 is an interesting part. The new part still uses HBM2e but is using eight stacks to achieve 128GB of capacity. Intel has up to 1.835PFLOPS of FP8 compute. It should also be noted that simply supporting FP8 is non-trivial as some accelerators do not support it at this point. The new part has 64 tensor processor cores and 8 matrix math engines.

Intel Gaudi 3 Highlights
Intel Gaudi 3 Highlights

We have been tracking Habana’s Gaudi since 2019, and the company later in 2019 was acquired by Intel when it had its Gaudi 1 generation. That was at the point that Nervana was still around and a lot of the action was happening around Facebook’s choice for AI acceleration. Since 2019’s Intel acquisition, there was a Gaudi 2 generation that Intel started putting effort behind in late 2022 as AI took off. The Gaudi 3 is the successor with huge upgrades in compute and bandwidth, and a shrink from 7nm to 5nm.

Intel Gaudi 2 To Intel Gaudi 3 Specs
Intel Gaudi 2 To Intel Gaudi 3 Specs

Just for some sense of scale as to how much bigger Gaudi 3 is, here is Gaudi 3 next to Gaudi 2.

Intel Gaudi 3 And Gaudi 2 Angle 1
Intel Gaudi 3 And Gaudi 2 Angle 1

Here is Gaudi 2 and Gaudi 3 in hand. It is plain to see when you are standing next to them that the Gaudi 3 silicon package is much larger.

Intel Gaudi 2 And Gaudi 3 In Hand 1
Intel Gaudi 2 And Gaudi 3 In Hand 1

On that silicon package, there are two dies, each with 48MB of SRAM, 16 tensor processing cores, and a media engine. Something that Habana Gaudi 1 did, and we have seen continue, is that Gaudi 3 uses Ethernet to scale up and out. In this generation, that means the 24 network interfaces are 200GbE up from 100GbE in Gaudi 2 and only 10x 100GbE in Gaudi 1. Intel uses Ethernet as the interconnect between AI accelerators in a chassis, but also to scale out to multiple AI accelerators in a data center. Compare that to NVIDIA with NVLink/ NVSwitch, Infiniband, and Ethernet in a HGX H100 platform.

Intel Gaudi 3 AI Architecture Overview
Intel Gaudi 3 AI Architecture Overview

Here is a closer look at the die. The thermal paste is there because this was a working card that was pulled. If you are wondering what is between the 8x 16GB HBM2e packages, we were told that is filler silicon to make the structure of the package.

Intel Gaudi 3 OAM Working Sample Package Close 1
Intel Gaudi 3 OAM Working Sample Package Close 1

One can also see the line between the two pieces of silicon that make up the main compute, SRAM, and networking portion of the AI accelerator.

Intel Gaudi 3 OAM Package In UBB 1
Intel Gaudi 3 OAM Package In UBB 1

Of course, a single AI accelerator today with LLM’s is less exciting. What is more exciting is a collection of AI accelerators. Here is an OCP UBB with the 8x Intel Gaudi 3 OAM accelerators. All are listed at 900W each, but we heard there may be more room with liquid-cooled variants. TSMC has gotten decent voltage frequency scaling and NVIDIA has been taking advantage of that as well. The system below is around 7.2kW+ of accelerators, not taking into account other components.

8x Intel Gaudi 3 OAM In UBB 1
8x Intel Gaudi 3 OAM In UBB 1

Here is the OAM package bottom for the Gaudi 3.

Intel Gaudi 3 OAM Bottom 1
Intel Gaudi 3 OAM Bottom 1

Gaudi 3 using Ethernet to scale was a key concept of Habana. The idea was that network admins fundamentally do not want to manage many types of data fabrics. Ethernet is ubiquitous so the aim was to use that, and this was before the idea of UltraEthernet came about. Of course, in 2024, network switch speeds are so much faster today with a modern 51.2T switch that can handle a ton of 200GbE devices. In 2019, with Gaudi 1, when 32-port 100GbE was still decently high-end, the network bandwidth and topologies were a bit more limited. The OAM package above takes 21 of the 24 lanes and uses 3x 200GbE to connect to each of the other seven OAM packages. The remaining 3x 200GbE links are then fed to OSFP connectors at the rear of the chassis.

Intel Gaudi 3 AI OAM Scale Out
Intel Gaudi 3 AI OAM Scale Out

We saw a Supermicro X14 Gaudi 3 platform at Intel Vision 2024. That platform has six OSFP connections on the top where the Gaudi 3 OAM UBB resides. That should help map to the diagram above.

Supermicro SYS 822GA NGR3 Intel Gaudi 3 8 Way 2
Supermicro SYS 822GA NGR3 Intel Gaudi 3 8 Way 2

Those six OSFP ports can then be connected to switches and scale-out topologies built. Intel color-coded the networks in its materials to help better illustrate this. The benefit here is that building big Ethernet networks is fairly well understood.

Intel Gaudi 3 AI OAM Scale Out To Data Center
Intel Gaudi 3 AI OAM Scale Out To Data Center

Intel says that the new Gaudi 3 is more power efficient, and sometimes faster than a NVIDIA H100 in inferencing. Our sense is that Gaudi 3 is going to find its home in the AI inference market.

Intel Gaudi 3 Inference To NVIDIA H100
Intel Gaudi 3 Inference To NVIDIA H100

At the same time, Intel Gaudi 3 can also offer faster training than the NVIDIA H100. While the NVIDIA H200 is announced, we are still a bit from seeing training numbers for that as it starts ramping in volume production, and then later this year the NVIDIA Blackwell products will see another jump. Gaudi 3 is really competing on price here. The power efficiency message will be more powerful later in 2024 when organizations have to look at the power footprints they have and decide when to keep or retire machines to free up power capacity.

Intel Gaudi 3 Training To NVIDIA H100
Intel Gaudi 3 Training To NVIDIA H100

To that end, Intel also has the Gaudi PCIe CEM, the HL-338. This is an add-in card with a 600W TDP. It takes a lot from chassis to go from the 300-350W PCIe accelerators of just a year or two ago to 600W TDP today.

Intel Gaudi 3 PCIe CEM 600W
Intel Gaudi 3 PCIe CEM 600W

All of these accelerators are sampling in the first half of this year. The air-cooled and liquid-cooled variants will go into production in the second half of this year. That ties to the Granite Rapids-AP Xeon 6 Supermicro platform shown above.

Intel Gaudi 3 Roll Out
Intel Gaudi 3 Roll Out

The important thing to remember is that these will be contemporaries of the NVIDIA H200 and Blackwell, so Intel will likely offer them at substantially lower cost than NVIDIA. Next up, in 2025 is Falcon Shores, which will be a GPU architecture designed for AI as well. Intel says that it will be making the transition easy for users of Gaudi 3.

Final Words

Overall, the Intel Gaudi 3 is a big step up with more memory, more compute, and faster interconnect.

Intel Gaudi 3 OAM Working Sample Package 1
Intel Gaudi 3 OAM Working Sample Package 1

The market needs more compute, so the Intel Gaudi 3 will sell well. If you saw our Touring the Intel AI Playground Inside the Intel Developer Cloud piece, since 2023 Intel has been supply constrained on Gaudi 2. As a result, they are selling the parts and putting them in the Intel Developer Cloud to both try and run production workloads.

Hopefully, later in 2024, we get to check out systems with the new parts.

5 COMMENTS

  1. The fact that there won’t be a Gaudi 4 but just Falcon Shore means no one will invest in making software for it.
    That was honest from Intel to say there won’t be a follow up, but that also ensured no sales.

  2. @Ziple
    Awesome that non computer science people are writing here.
    I encourage everyone to learn about CS. It’s fun to work in the industry.

    Apart from that:
    It is called an API.
    An API is used to abstract from the actual hardware.
    The very basics of computer science is being able to abstract from machine details such as instruction sets.
    Intel’s counterpart to CUDA (which is also just an API also) is OneAPI / Khronos SYCL. Or PyTorch or … or…
    Also there are semi-automatic conversions from CUDA to SYCL.
    Imagine having to support every hardware on this planet by using assembly / machine instructions.
    Yes, that would be bad. I agree
    Apart from that I don’t know if there are customers for Gaudi 3 and whether it will be a success.
    But I highly doubt you can judge that either because you don’t seem to have enough knowledge.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.