Detail of the Giant Cerebras Wafer-Scale Cluster

0
Cerebras Wafer Scale Cluster HC35_Page_32
Cerebras Wafer Scale Cluster HC35_Page_32

In terms of AI startups, Cerebras has been our front-runner to make it to the next stage for years. Now, it seems to have exited a gaggle of startups scaling its giant wafer scale engine to AI supercomputer scale (for revenue.) At Hot Chips 2023, the company is detailing the new cluster that it plans to use to dwarf what NVIDIA is building.

We are doing this live, so please excuse typos.

Detail of the NVIDIA Dwarfing Cerebras Wafer-Scale Cluster

Cerebras started the presentation with a company update and that AI/ ML models are getting bigger (~40,000x in 5 years.) They also discussed some of the history of ML acceleration.

Cerebras Wafer Scale Cluster HC35_Page_05
Cerebras Wafer Scale Cluster HC35_Page_05

Process technology has given gains over time.

Cerebras Wafer Scale Cluster HC35_Page_06
Cerebras Wafer Scale Cluster HC35_Page_06

Architecture gains such as changing calculations from FP32 to bfloat16, INT8, or other formats and techniques have also given huge gains.

Cerebras Wafer Scale Cluster HC35_Page_07
Cerebras Wafer Scale Cluster HC35_Page_07

Still, what models are practical to use depends on the ability to not just get gains at a chip level, but also at the cluster level.

Cerebras Wafer Scale Cluster HC35_Page_08
Cerebras Wafer Scale Cluster HC35_Page_08

Some of the challenges of current scale-out is just the communication needs to keep data moving to smaller compute and memory nodes.

Cerebras Wafer Scale Cluster HC35_Page_09
Cerebras Wafer Scale Cluster HC35_Page_09

Cerebras built a giant chip to get an order-of-magnitude improvement, but it also needs to scale out to clusters since one chip is not enough.

Cerebras Wafer Scale Cluster HC35_Page_10
Cerebras Wafer Scale Cluster HC35_Page_10

Traditional scale-out has challenges because it is trying to split a problem, data, and compute across so many devices.

Cerebras Wafer Scale Cluster HC35_Page_12
Cerebras Wafer Scale Cluster HC35_Page_12

On GPUs, that means using different types of parallelism to scale out to more compute and memory devices.

Cerebras Wafer Scale Cluster HC35_Page_13
Cerebras Wafer Scale Cluster HC35_Page_13

Cerebras is looking to scale cluster level memory and cluster level compute to decouple compute and memory scaling as is seen on GPUs.

Cerebras Wafer Scale Cluster HC35_Page_14
Cerebras Wafer Scale Cluster HC35_Page_14

Cerebras has 850,000 cores on the WSE-2 for its base. When will we get a 5nm WSE-3? Sounds like not today.

Cerebras Wafer Scale Cluster HC35_Page_15
Cerebras Wafer Scale Cluster HC35_Page_15

Cerebras houses the WSE-2 in a CS2 and then connects it to MemoryX. It then can stream data to the big chip.

Cerebras Wafer Scale Cluster HC35_Page_17
Cerebras Wafer Scale Cluster HC35_Page_17

It then has the SwarmX interconnect that does the data parallel scaling.

Cerebras Wafer Scale Cluster HC35_Page_18
Cerebras Wafer Scale Cluster HC35_Page_18

Weights are never stored on the wafer. They are just streamed in.

Cerebras Wafer Scale Cluster HC35_Page_19
Cerebras Wafer Scale Cluster HC35_Page_19

The SwarmX fabric scales weights and reduces gradients on the return.

Cerebras Wafer Scale Cluster HC35_Page_20
Cerebras Wafer Scale Cluster HC35_Page_20

Each MemoryX unit has 12x MemoryX nodes. States are stored in DRAM and in flash. Up to 1TB of DRAM and 500TB of flash. The CPUs are interestingly only 32-core CPUs.

Cerebras Wafer Scale Cluster HC35_Page_22
Cerebras Wafer Scale Cluster HC35_Page_22

Finally, it is connected to the cluster using 100GbE. One port goes to the CS-2 and one to other MemoryX modules.

MemoryX has to handle the sharding of the weights in a thoughtful way to make this work. Ordering the streaming helps perform an almost free transpose.

Cerebras Wafer Scale Cluster HC35_Page_23
Cerebras Wafer Scale Cluster HC35_Page_23

In MemoryX, there is a high-performance runtime in order to transfer data and perform computations.

Cerebras Wafer Scale Cluster HC35_Page_24
Cerebras Wafer Scale Cluster HC35_Page_24

SwarmX fabric uses 100GbE and RoCE RDMA to provide connectivity and Broadcast Reduce that happens on CPUs.

Cerebras Wafer Scale Cluster HC35_Page_25
Cerebras Wafer Scale Cluster HC35_Page_25

Every broadcast reduce node has 12 nodes with 6x 100GbE links. Five of them are used for a 1:4 broadcast plus a redundant link. That means 150Tbps of broadcast reduce bandwidth.

Cerebras Wafer Scale Cluster HC35_Page_26
Cerebras Wafer Scale Cluster HC35_Page_26′

100GbE is interesting since it is now a very commoditized interconnect as compared to NVLink/ NVSwitch and InfiniBand.

Cerebras is doing these operations off of the CS-2/ WSE and that is helping this scale.

Cerebras Wafer Scale Cluster HC35_Page_27
Cerebras Wafer Scale Cluster HC35_Page_27

This is the SwarmX topology.

Cerebras Wafer Scale Cluster HC35_Page_28
Cerebras Wafer Scale Cluster HC35_Page_28

The flexibility in the fabric can be used to effectively provision work across the cluster while supporting sub-cluster partitioning.

Cerebras Wafer Scale Cluster HC35_Page_29
Cerebras Wafer Scale Cluster HC35_Page_29

Here is the Cerebras WSE-2 with me at ISC 2022:

Patrick With Cerebras WSE 2 Hamburg ISC 2022
Patrick With Cerebras WSE 2 Hamburg ISC 2022

That goes into an engine block that looks like this:

Cerebras CS 2 WSE 2 Heart At SC22 4
Cerebras CS 2 WSE 2 Heart At SC22 4

That goes into the Cerebras CS-2.

Cerebras Wafer Scale Cluster HC35_Page_31
Cerebras Wafer Scale Cluster HC35_Page_31

Those were built into racks.

Cerebras Wafer Scale Cluster HC35_Page_32
Cerebras Wafer Scale Cluster HC35_Page_32

We can say hello to Supermicro 1U servers above the CS-2’s.

Then CS-2’s went into larger clusters.

Cerebras Wafer Scale Cluster HC35_Page_33
Cerebras Wafer Scale Cluster HC35_Page_33

Now bigger clusters.

Cerebras Wafer Scale Cluster HC35_Page_34
Cerebras Wafer Scale Cluster HC35_Page_34

This is the older Andromeda wafer scale cluster.

Cerebras Wafer Scale Cluster HC35_Page_35
Cerebras Wafer Scale Cluster HC35_Page_35

Cerebras was training large models on Andromeda quickly with 16x CS-2’s.

Cerebras Wafer Scale Cluster HC35_Page_36
Cerebras Wafer Scale Cluster HC35_Page_36

It found that programming a job for a single CS-2 scaled to 16x CS-2’s.

Cerebras Wafer Scale Cluster HC35_Page_37
Cerebras Wafer Scale Cluster HC35_Page_37

Then Cerebras got bigger with the Condor Galaxy-1 Wafer Scale Cluster that we covered in: 100M USD Cerebras AI Cluster Makes it the Post-Legacy Silicon AI Winner.

Cerebras Wafer Scale Cluster HC35_Page_38
Cerebras Wafer Scale Cluster HC35_Page_38

Cerebras traned BTLM on that which is the top 3B model right now.

Cerebras Wafer Scale Cluster HC35_Page_39
Cerebras Wafer Scale Cluster HC35_Page_39

Next, Cerebras is scaling to even larger clusters.

Cerebras Wafer Scale Cluster HC35_Page_40
Cerebras Wafer Scale Cluster HC35_Page_40

Final Words

I fell pretty far behind covering this talk. Still, Cerebras is big game hunting which is important in the era of big models. Having customers buying huge amounts of hardware to get these clusters online is a big vote of confidence for the company. It is a very different approach to scaling than NVIDIA and companies trying to duplicate NVIDIA.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.