FuriosaAI RNGD Processor for Sustainable AI Compute

0
FuriosaAI RNGD Without Cooler_1
FuriosaAI RNGD Without Cooler_1

We are hearing more about sustainable AI compute, and FuriosaAI has its solution with RNGD. This is almost the opposite of many AI compute platforms we have learned about today. Instead of going for maximum power, this is a lower power compute solution.

This is the last talk of the day after over a dozen, and this is being done live, so please excuse typos.

FuriosaAI RNGD Processor for Sustainable AI Compute

Here are the specs on the card. This is specifically not designed to be the fastest AI chip on the market.

FuriosaAI RNGD Hot Chips 2024_Page_05
FuriosaAI RNGD Hot Chips 2024_Page_05

Here is a look at the card with its cooler.

FuriosaAI RNGD Without Cooler And With Cooler
FuriosaAI RNGD Without Cooler And With Cooler

The target TDP is only 150W for air cooled data centers.

FuriosaAI RNGD Hot Chips 2024_Page_06
FuriosaAI RNGD Hot Chips 2024_Page_06

The construction is done using 12-layer HBM3 and TSMC CoWoS-S and a 5nm process.

FuriosaAI RNGD Hot Chips 2024_Page_07
FuriosaAI RNGD Hot Chips 2024_Page_07

Instead of focusing on the H100 or B100, FuriosaAI is targeting the NVIDIA L40S. We did a big piece on the L40S some time ago. The goal is to not just provide similar performance, but also to provide that performance at lower power.

FuriosaAI RNGD Hot Chips 2024_Page_08
FuriosaAI RNGD Hot Chips 2024_Page_08

The efficiency comes from hardware, software, and algorithm.

FuriosaAI RNGD Hot Chips 2024_Page_09
FuriosaAI RNGD Hot Chips 2024_Page_09

One of the challenges FuriosaAI has been trying to work on the abstraction layer between hardware and software.

FuriosaAI RNGD Hot Chips 2024_Page_11
FuriosaAI RNGD Hot Chips 2024_Page_11

Tensor Contraction is one of the big operations that FuriosaAI. In BERT, this was over 99% of the FLOPS.

FuriosaAI RNGD Hot Chips 2024_Page_12
FuriosaAI RNGD Hot Chips 2024_Page_12

Usually, we have matrix multiplication as a primitive. Instead of the Tensor Contraction.

FuriosaAI RNGD Hot Chips 2024_Page_13
FuriosaAI RNGD Hot Chips 2024_Page_13

Instead, the abstraction is at the Tensor Contraction level.

FuriosaAI RNGD Hot Chips 2024_Page_14
FuriosaAI RNGD Hot Chips 2024_Page_14

Furiosa adds a low-level einsum for its primitive.

FuriosaAI RNGD Hot Chips 2024_Page_15
FuriosaAI RNGD Hot Chips 2024_Page_15

Here, matrices A and B are multiplied to produce C.

FuriosaAI RNGD Hot Chips 2024_Page_16
FuriosaAI RNGD Hot Chips 2024_Page_16

Furiosa then takes this, and schedules it on the actual architecture with memory and compute units.

FuriosaAI RNGD Hot Chips 2024_Page_17
FuriosaAI RNGD Hot Chips 2024_Page_17

From here, a whole tensor contraction can be a primitive.

FuriosaAI RNGD Hot Chips 2024_Page_18
FuriosaAI RNGD Hot Chips 2024_Page_18

By taking into account spatial and temporal orchestration, they can boost efficiency and utilization.

FuriosaAI RNGD Hot Chips 2024_Page_19
FuriosaAI RNGD Hot Chips 2024_Page_19

Furiosa says that it has flexible reconfigurability which is important to keep performance high as batch sizes vary.

FuriosaAI RNGD Hot Chips 2024_Page_20
FuriosaAI RNGD Hot Chips 2024_Page_20

Here is a look at the RNGD implementation.

FuriosaAI RNGD Hot Chips 2024_Page_21
FuriosaAI RNGD Hot Chips 2024_Page_21

Here are the interconnect networks, including to get to the scratchpad memory.

FuriosaAI RNGD Hot Chips 2024_Page_22
FuriosaAI RNGD Hot Chips 2024_Page_22

Furiosa is using PCIe Gen5 xq6 for chip-to-chip communication. It is also using P2P over a PCIe switch for direct GPU to GPU communication. That is why if XConn can get it right, they have an awesome product.

FuriosaAI RNGD Hot Chips 2024_Page_23
FuriosaAI RNGD Hot Chips 2024_Page_23

Furiosa supports SR-IOV for virtualization.

FuriosaAI RNGD Hot Chips 2024_Page_24
FuriosaAI RNGD Hot Chips 2024_Page_24

The company has done work on signal and power integrity for reiliability.

FuriosaAI RNGD Hot Chips 2024_Page_25
FuriosaAI RNGD Hot Chips 2024_Page_25

Here is how Furiosa LLM works in flow chart form.

FuriosaAI RNGD Hot Chips 2024_Page_27
FuriosaAI RNGD Hot Chips 2024_Page_27

The compiler compiles each partition mapped to multiple devices.

FuriosaAI RNGD Hot Chips 2024_Page_28
FuriosaAI RNGD Hot Chips 2024_Page_28

The compiler optimizes the model for performance gains and energy efficiency.

FuriosaAI RNGD Hot Chips 2024_Page_29
FuriosaAI RNGD Hot Chips 2024_Page_29

The serving framework does things like continuous batching to get more utilization.

FuriosaAI RNGD Hot Chips 2024_Page_30
FuriosaAI RNGD Hot Chips 2024_Page_30

The company has a graph-based automation tool to help with quantization. Furiosa can support a number of different formats including FP8 and INT4.

FuriosaAI RNGD Hot Chips 2024_Page_31
FuriosaAI RNGD Hot Chips 2024_Page_31

Here is the company’s development methodology.

FuriosaAI RNGD Hot Chips 2024_Page_32
FuriosaAI RNGD Hot Chips 2024_Page_32

Final Words

There was a lot here. The quick summary is that the company is using its compilers and software to help map AI inference into its lower-power SoC in order to provide lower power AI inference.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.