Cerebras Enters AI Inference Blows Away Tiny NVIDIA H100 GPUs by Besting HBM

August 27, 2024

Cerebras Wafer Scale AI Hot Chips 2024_Page_40

Cerebras makes a wafer-sized computing chip, and the infrastructure around it is used to build something much bigger than NVIDIA’s GPUs. At Hot Chips 2024, we will learn more about the company’s move into the AI inference space. I got a sneak preview of the performance, and it feels almost silly compared to H100 inference. Instead of having to scale to multiple GPUs or even just go off-chip to HBM memory, Cerebras just puts the entire model into a giant chip’s SRAM.

Please excuse typos. This is being written live at Hot Chips.

Cerebras Enters AI Inference Blows Away Tiny NVIDIA H100 GPUs

As a recap, Cerebras has a giant chip that has 44GB of SRAM and tons of cores. This is the biggest square chip you can make from a round wafer.

Cerebras Wafer Scale AI Hot Chips 2024_Page_03

Here is the scale again versus a typical GPU. While GPUs break down the large wafer into small chips and then try to stitch them together, Cerebras just leaves them assembled as a wafer chip.

Cerebras Wafer Scale AI Hot Chips 2024_Page_04

Here is what the box looks like.

Cerebras Wafer Scale AI Hot Chips 2024_Page_05

Outside the box there are many servers. Cerebras started with a cluster in Santa Clara that we covered.

Cerebras Wafer Scale AI Hot Chips 2024_Page_06

Here is the second cluster up in Stockton, CA of a similar size.

Cerebras Wafer Scale AI Hot Chips 2024_Page_07

The third system was five times as large in Dallas, Texas.

Cerebras Wafer Scale AI Hot Chips 2024_Page_08

The new cluster is in MN and is eight times larger than the first.

Cerebras Wafer Scale AI Hot Chips 2024_Page_09

The chip was designed to train large models.

Cerebras Wafer Scale AI Hot Chips 2024_Page_10

And it does os every day.

Cerebras Wafer Scale AI Hot Chips 2024_Page_11

Using the SRAM instead of HBM, Cerebras can scale beyond what HBM allows.

Cerebras Wafer Scale AI Hot Chips 2024_Page_17

Cerebras says it is 20x faster on Llama3.1-8B versus cloud offerings using NVIDIA H100’s at places like Microsoft Azure.

Cerebras Wafer Scale AI Hot Chips 2024_Page_19

Google generative AI search is slow.

Cerebras Wafer Scale AI Hot Chips 2024_Page_21

So, having faster inference is important for a good user experience.

Cerebras Wafer Scale AI Hot Chips 2024_Page_22

Here is a demo where using Llama3.1-70B Cerebras just dominates the DGX-H100 solution.

Cerebras Wafer Scale AI Hot Chips 2024_Page_23

Here is the benchmark.

Cerebras Wafer Scale AI Hot Chips 2024_Page_24

There is a difference between generating the first and subsequent tokens, which is one of the reasons that going to HBM is too slow.

Cerebras Wafer Scale AI Hot Chips 2024_Page_27

Wafer scale gives massive SRAM (44GB) and so Cerebras does not need to go to slow HBM memory.

Cerebras Wafer Scale AI Hot Chips 2024_Page_28

Here is a look at the WSE-3 core with SRAM.

Cerebras Wafer Scale AI Hot Chips 2024_Page_29

That small core is replicated across dies and the entire wafer.

Cerebras Wafer Scale AI Hot Chips 2024_Page_30

From an interconnect perspective, this is all on-chip, there is no need to go off the chip to another package.

Cerebras Wafer Scale AI Hot Chips 2024_Page_31

Instead of having to go over the HBM memory interface, with Cerebras everything is on chip.

Cerebras Wafer Scale AI Hot Chips 2024_Page_32

Aggregating memory bandwidth across H100’s is possible.

Cerebras Wafer Scale AI Hot Chips 2024_Page_33

But even within a DGX H100 8x NVIDIA H100 GPU solution, it requires a lot of serial interfaces that sock power.

Cerebras Wafer Scale AI Hot Chips 2024_Page_34

By not going off-die, Cerebras can get more memory bandwidth at lower power.

Cerebras Wafer Scale AI Hot Chips 2024_Page_35

By not going off-die, Cerebras does not need to go through high-speed serial links, PCBs, switch chips, and so forth. Instead, it just moves data through the silicon.

Cerebras Wafer Scale AI Hot Chips 2024_Page_36

We will let you read this one on why DGX H100 scaling is a challenge.

Cerebras Wafer Scale AI Hot Chips 2024_Page_37

Cerebras put a peak memory bandwidth utilization on the NVIDIA DGX H100 on a chart with how much cloud providers are really getting on Llama3.1-70B inference.

Cerebras Wafer Scale AI Hot Chips 2024_Page_38

Here is how Cerebras does it on a single chip.

It can place an entire layer on a portion of the wafer. Placing the layers adjacent means, again, means minimal data movement.

Cerebras Wafer Scale AI Hot Chips 2024_Page_41

As a result of the memory bandwidth, it can run at batch size 1 instead of larger batch sizes.

Cerebras Wafer Scale AI Hot Chips 2024_Page_42

Here is the idea of token generation happening across layers.

Cerebras Wafer Scale AI Hot Chips 2024_Page_43

Here it is moving again.

Cerebras Wafer Scale AI Hot Chips 2024_Page_44

And again (sorry doing this live.)

Cerebras Wafer Scale AI Hot Chips 2024_Page_45

Once it is done, it can move to the next token.

Cerebras Wafer Scale AI Hot Chips 2024_Page_46

While Llama3.1-8B runs on a single WSE-3 chip.

Cerebras Wafer Scale AI Hot Chips 2024_Page_48

For larger models like Llama3.1-70B it needs to scale across four wafers.

Cerebras Wafer Scale AI Hot Chips 2024_Page_50

The hops only involve activations between wafers meaning this is not a huge loss of performance.

Cerebras Wafer Scale AI Hot Chips 2024_Page_51

With this scale-out methodology, Cerebras can scale to the latency/ throughput area that GPUs cannot play.

Cerebras Wafer Scale AI Hot Chips 2024_Page_53

Here is why this works on Cerebras. A single user is only using a portion of the bandwidth of a chip.

Cerebras Wafer Scale AI Hot Chips 2024_Page_54

As a result, multiple users can be run on the same chip concurrently.

Cerebras Wafer Scale AI Hot Chips 2024_Page_55

Here is an image with 3.

Cerebras Wafer Scale AI Hot Chips 2024_Page_56

Here is four users on here.

Cerebras Wafer Scale AI Hot Chips 2024_Page_57

Also, one can process prompt tokens in parallel.

Cerebras Wafer Scale AI Hot Chips 2024_Page_58

Here is a slide showing this.

Cerebras Wafer Scale AI Hot Chips 2024_Page_59

Here is a single user doing more than one operation.

Cerebras Wafer Scale AI Hot Chips 2024_Page_60

And these build slides continue.

Cerebras Wafer Scale AI Hot Chips 2024_Page_61

Here are three users working on multiple prompts with different layers on a single chip at the same time.

Cerebras Wafer Scale AI Hot Chips 2024_Page_62

Going back to the chart.

Cerebras Wafer Scale AI Hot Chips 2024_Page_63

You need to zoom out 10x to see where Cerebras is.

Cerebras Wafer Scale AI Hot Chips 2024_Page_66

Cerebras says this is just the beginning. The company things it can get more speed and more throughput.

Cerebras Wafer Scale AI Hot Chips 2024_Page_67

You can use this today using the Cerebras Inference Service.

Cerebras Wafer Scale AI Hot Chips 2024_Page_69

Here is how to try this. We do not control the QR code so, of course, be cautious there. Something exciting is that Llama-405B is coming as well as other large models.

Cerebras Wafer Scale AI Hot Chips 2024_Page_70

That is really cool.

Final Words

I had the opportunity to sit with Andrew Feldman (CEO of Cerebras) before the talk and he showed me the demos live. It is obscenely fast.

The reason this matters is not just for human to prompt interaction. Instead, in a world of agents where computer AI agents talk to several other computer AI agents. Imagine if it takes seconds for each agent to come out with output, and there are multiple steps in that pipeline. If you think about automated AI agent pipelines, then you need fast inferencing to reduce the time for the entire chain.

4 COMMENTS

George August 27, 2024 At 7:58 pm

Their Llama3.1-8B is insanely fast. It’s just wild.
necr August 28, 2024 At 9:01 am

Wafers seem to kick ass. With all happening on-chip, it just eliminates all distant copies (even with the latest NVLink fabric there’s performance hit, it’s power hungry, etc). But how does this work economically (i.e yield, does this wafer cost $$$ millions)? Calling wafer experts.
Ziple August 28, 2024 At 11:26 am

Yeah… Calling partial BS on that one. They report results here only for small batch sizes (did they even report just batch size 1?) from what I understand. But for any cloud service with users (case that is desirable, it means you have customers), then batch size 1 is not relevant. Pretty sure their advantage disappears for high batch sizes (~128 or more). Same as groq.
Charles Chevaux September 13, 2024 At 12:56 am

The IPO is coming as soon as October! First things first. Then worry about yield rate. If this tech were to exist and could be packaged and sold like Nvidia sells GPUs, it would make a fortune. No need, no reason to be service provider.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Cerebras Enters AI Inference Blows Away Tiny NVIDIA H100 GPUs

Final Words

RELATED ARTICLESMORE FROM AUTHOR

NVIDIA RTX Pro 6000 Blackwell Family for Workstations and Servers

ASUS RS720-E12-RS8G 2U Intel Xeon 6 Server Review

Gigabyte GIGAPOD and GB300 NVL72 Compute Blades at NVIDIA GTC 2025

4 COMMENTS

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR