Intel Gaudi 3 for AI Training and Inference

0
Intel Gaudi 3 OAM Working Sample Package 1
Intel Gaudi 3 OAM Working Sample Package 1

Intel’s main AI chip until Falcon Shores is the Intel Gaudi 3. We got some new details at Hot Chips 2024. We have been covering this for some time (e.g., in April 2024), but it is supposed to be transferring from sampling to production through 2024.

Since this is being done live, please excuse typos. By 5PM fingers are feeling rough.

Intel Gaudi 3 for AI Training and Inference

This is the third generation of Gaudi since 2019 or so. This generation adds more compute more memory bandwidth, and capacity.

Intel Gaudi 3 Hot Chips 2024_Page_02
Intel Gaudi 3 Hot Chips 2024_Page_02

Here is the OAM module. The two interconnected compute dies are mirror images of each other.

Intel Gaudi 3 Hot Chips 2024_Page_03
Intel Gaudi 3 Hot Chips 2024_Page_03

Here is the block diagram. Pretty interesting here is that there are 14 decoders for HEVC, H264, JPEG, and VP9. That is important for video inference. We also get a lot of speeds and feeds.

Intel Gaudi 3 Hot Chips 2024_Page_04
Intel Gaudi 3 Hot Chips 2024_Page_04

Each die has two DCOREs or deep learning cores. Each has a pair of matrix multiplication engines, and sixteen tensor processor cores, along with 24MB of cache.

Intel Gaudi 3 Hot Chips 2024_Page_05
Intel Gaudi 3 Hot Chips 2024_Page_05

The matrix multiplication engine is the big matrix compute engine of the Gaudi 3 accelerator.

Intel Gaudi 3 Hot Chips 2024_Page_06
Intel Gaudi 3 Hot Chips 2024_Page_06

The tensor processors are for non-Matmul compute.

Intel Gaudi 3 Hot Chips 2024_Page_07
Intel Gaudi 3 Hot Chips 2024_Page_07

The L2, L3, and HBM are all in a unified memory space. There is also a memory context ID that allows tagging cache lines being shared. There is also a near memory compute capability to save some work for the TPC.

Intel Gaudi 3 Hot Chips 2024_Page_08
Intel Gaudi 3 Hot Chips 2024

The Gaudi 3 also has a unique control path and a runtime driver.

Intel Gaudi 3 Hot Chips 2024_Page_09
Intel Gaudi 3 Hot Chips 2024_Page_09

Here is a bit on the Intel Gaudi software suite. I somewhat wish that Intel had gone a step further and just talked about Gaudi suite to Falcon Shores. If Falcon Shores is 2025, it feels like that should be part of the discussion.

Intel Gaudi 3 Hot Chips 2024_Page_10
Intel Gaudi 3 Hot Chips 2024_Page_10

The graph compiler orchestrates how work is split among the accelerator. The NOC bandwidth was designed to support parallel MME and TPC work.

Intel Gaudi 3 Hot Chips 2024_Page_11
Intel Gaudi 3 Hot Chips 2024_Page_11

One of the cool things that Habana Labs has done since we saw them at Hot Chips 31 in 2019 when Hot Chips was last in the Stanford Memorial Theater. Habana uses RDMA Ethernet networking from the accelerators to connect each accelerator with each other, and then to larger topologies.

Intel Gaudi 3 Hot Chips 2024_Page_12
Intel Gaudi 3 Hot Chips 2024_Page_12

Here are some performance benchmarks. Scaling is happening, but it looks like the Llama3-8B is still being optimized.

Intel Gaudi 3 Hot Chips 2024_Page_13
Intel Gaudi 3 Hot Chips 2024_Page_13

With the Ethernet network, Gaudi 3 is designed to be scaled out easily using standard networking.

Intel Gaudi 3 Hot Chips 2024_Page_14
Intel Gaudi 3 Hot Chips 2024_Page_14

At the same time, the question is whether it is on “any Scale” or if they have practically tested to 65,000 or 100,000+ accelerators like on a high-end system.

Final Words

This is a chip that is ramping production so we should see more of them soon. We first showed the Gaudi 3 UBB earlier this year after showing the Gaudi 2 in the Intel Developer Cloud last year.

We also showed the Supermicro Gaudi 3 box in April 2024 as well.

Supermicro SYS 822GA NGR3 Intel Gaudi 3 8 Way 2
Supermicro SYS 822GA NGR3 Intel Gaudi 3 8 Way 2

We have lots here. Now we want to see these deployed at scale.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.