SambaNova SN40L RDU for Trillion Parameter AI Models

0
SambaNova SN40L Large
SambaNova SN40L Large

At Hot Chips 2024, a theme is clearly AI. The SambaNova SN40L RDU is the company’s first design for the trillion parameter scale AI model era. I think the SN40L is outside, but there were a ton of people around earlier today. Hopefully, I will be able to grab a snapshot later.

Please note that we are doing these live at Hot Chips 2024 this week, so please excuse typos.

SambaNova SN40L RDU for Trillion Parameter AI Models

The new SambaNova SN40L “Cerulean” architecture. This is a 5nm TSMC chip with three tiers of memory, which is really neat. It is also a dataflow architecture that is designed to be a training and inference chip.

Sambanova SN40L Hot Chips 2024_Page_02
Sambanova SN40L Hot Chips 2024_Page_02

The three tiers of memory are on-chip SRAM at 520MB. There is then 64GB of HBM.  There is then additional DDR memory as a capacity tier. SambaNova is showing a 16 socket system here to get features like 8GB of on-chip SRAM and 1TB of HBM.

Sambanova SN40L Hot Chips 2024_Page_03
Sambanova SN40L Hot Chips 2024_Page_03

Here is a look at the 1040 compute and memory units with their mesh switches in the SambaNova tile.

Sambanova SN40L Hot Chips 2024_Page_04
Sambanova SN40L Hot Chips 2024_Page_04

Here is the compute unit. Instead having a traditional fetch/ decode and so forth execution unit, it has a sequence of static stages. The PCU can operate as a streaming unit (data from left to right), the blue is a cross lane reduction tree. In a matrix compute operation, it can be used as a systolic array.

Sambanova SN40L Hot Chips 2024_Page_05
Sambanova SN40L Hot Chips 2024_Page_05

Here is the high-level memory unit block diagram. These are programmable managed scratch pads instead of traditional caches.

Sambanova SN40L Hot Chips 2024_Page_06
Sambanova SN40L Hot Chips 2024_Page_06

The chip also has a mesh network. There are three physical networks: vector, scalar, and control.

Sambanova SN40L Hot Chips 2024_Page_07
Sambanova SN40L Hot Chips 2024_Page_07

AGCU is being used to access off-chip memory (HBM and DDR) while the PCU is designed to access the on-chip SRAM scratchpads.

Sambanova SN40L Hot Chips 2024_Page_08
Sambanova SN40L Hot Chips 2024_Page_08

Here is the top-level interconnect.

Sambanova SN40L Hot Chips 2024_Page_09
Sambanova SN40L Hot Chips 2024_Page_09

Here is an example of how the Softmax is caught by the compiler and then mapped to the hardware.

Sambanova SN40L Hot Chips 2024_Page_10
Sambanova SN40L Hot Chips 2024_Page_10

Mapping this to transformer models for LLMs and GenAI, here is the mapping. Looking inside the decoder, there are many different operations.

Sambanova SN40L Hot Chips 2024_Page_12
Sambanova SN40L Hot Chips 2024_Page_12

Here is a look at the decoder zoom-in. Each box is an operator. At the same time, usually you are running multiple of operator, and you keep data on chip for reuse.

Sambanova SN40L Hot Chips 2024_Page_13
Sambanova SN40L Hot Chips 2024_Page_13

Here is SambaNova’s guess on how operators are fused on GPUs. They noted this may not be accurate.

Sambanova SN40L Hot Chips 2024_Page_14
Sambanova SN40L Hot Chips 2024_Page_14

In the RDU, the entire decoder is one kernel call. The compiler is what does this mapping.

Sambanova SN40L Hot Chips 2024_Page_15
Sambanova SN40L Hot Chips 2024_Page_15

This slide is just a “cool picture” of the decoder mapping.

Sambanova SN40L Hot Chips 2024_Page_16
Sambanova SN40L Hot Chips 2024_Page_16

Returning to the transformer’s structure, here are the different functions for decoders. Something that you can see is that each function call has launch overheads.

Sambanova SN40L Hot Chips 2024_Page_17
Sambanova SN40L Hot Chips 2024_Page_17

Instead of 32 calls, it is written as one call.

Sambanova SN40L Hot Chips 2024_Page_18
Sambanova SN40L Hot Chips 2024_Page_18

Translating that, it means that less of the call overhead is being done since there is one call instead of multiple calls. As a result, you increase the time the chip is doing useful work on the data.

Sambanova SN40L Hot Chips 2024_Page_19
Sambanova SN40L Hot Chips 2024_Page_19

Here is SambaNova’s performance on llama3.1. There is a QR code here. It is SambaNova’s. We suggest not using it since… well who knows.

Sambanova SN40L Hot Chips 2024_Page_20
Sambanova SN40L Hot Chips 2024_Page_20

Here is scaling versus batch size with another QR code. The same warning applies. As you increase batch size, you ideally get scaling aligned with the batch size. SambaNova gets close.

Sambanova SN40L Hot Chips 2024_Page_21
Sambanova SN40L Hot Chips 2024_Page_21

As a result, SambaNova says it has a compelling inference product. The DDR is being used for mixture of expert model checkpoints. The DDR onboard means that SambaNova does not need to go to a host CPU to get that data. Alternatively, you would need more GPUs to hold all of these checkpoints for the expert models. That DDR helps a lot on the model switching side.

Sambanova SN40L Hot Chips 2024_Page_22
Sambanova SN40L Hot Chips 2024_Page_22

Here is the slide on training.

Sambanova SN40L Hot Chips 2024_Page_23
Sambanova SN40L Hot Chips 2024_Page_23

It looks like we are running a bit late so this is ending quickly.

Final Words

Overall, this is cool stuff. It is cool to see the company’s accelerator. I will update this piece if I get another close-up photo of the chip at a break.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.