Tenstorrent Blackhole and Metalium For Standalone AI Processing

0
Tenstorrent Hot Chips 2024_Page_13
Tenstorrent Hot Chips 2024_Page_13

Tenstorrent showed off more about its Blackhole silicon at Hot Chips 2024. Tenstorrent is one of those companies that is doing something different and has a lot of investment as well as actual design wins, making it interesting to see what they are building.

Please note that we are doing these live at Hot Chips 2024 this week, so please excuse typos.

Tenstorrent Blackhole and Metalium For Standalone AI Processing

Here is the Tenstorrent AI Silicon roadmap. Blackhole is the 2023 and later chip that was the big update over the previous generation Grayskull and Wormhole.

Tenstorrent Hot Chips 2024_Page_03
Tenstorrent Hot Chips 2024_Page_03

Blackhole is a standalone AI computer based on Ethernet.

The sixteen RISC-V cores are in four clusters of four. The Tensix cores are in the middle with Ethernet on top.

Tenstorrent Hot Chips 2024_Page_04
Tenstorrent Hot Chips 2024_Page_04

The chip has 10x 400Gbps Ethernet and 512GB/s of bandwidth.

Tenstorrent Hot Chips 2024_Page_05
Tenstorrent Hot Chips 2024_Page_05

The sixteen big RSIC-V cores can run Linux. The other 752 RISC-V are called “baby” cores that are programmable using C kernels, but they do not run Linux.

The baby RISC-V’s are programmable and are for compute, moving data, and storage.

Tenstorrent Hot Chips 2024_Page_07
Tenstorrent Hot Chips 2024_Page_07

This is a look at the Tensix core with 5 baby RISC-Vs.

Tenstorrent Hot Chips 2024_Page_08
Tenstorrent Hot Chips 2024_Page_08

There are also two routers that connect to the NOC.

Tenstorrent Hot Chips 2024_Page_09
Tenstorrent Hot Chips 2024_Page_09

The user can write a compute kernel and two data movement kernels on each Tensix core.

Tenstorrent Hot Chips 2024_Page_10
Tenstorrent Hot Chips 2024_Page_10

Here is a bit more on the data movement kernels.

Tenstorrent Hot Chips 2024_Page_11
Tenstorrent Hot Chips 2024_Page_11

Zooming into the routing. The NOC is statically scheduled. The routers move up and to the left or down and to the right.

Tenstorrent Hot Chips 2024_Page_12
Tenstorrent Hot Chips 2024_Page_12

The cores can be used to do simple or complex operations depending on what is required.

Tenstorrent Hot Chips 2024_Page_13
Tenstorrent Hot Chips 2024_Page_13

Onto the compute engine, there is a tile math engine and a vector math engine.

Tenstorrent Hot Chips 2024_Page_14
Tenstorrent Hot Chips 2024_Page_14

The Tile engine operates on a 32×32 tile.

Tenstorrent Hot Chips 2024_Page_15
Tenstorrent Hot Chips 2024_Page_15

Here is more on the vector math engine:

Tenstorrent Hot Chips 2024_Page_16
Tenstorrent Hot Chips 2024_Page_16

One user compute kernel is automatically compiled to 3 RISC-V threads.

Tenstorrent Hot Chips 2024_Page_17
Tenstorrent Hot Chips 2024_Page_17

Here is the kernel synchronization. There is hardware-enabled flow control to help synchronize kernels.

Tenstorrent Hot Chips 2024_Page_18
Tenstorrent Hot Chips 2024_Page_18

Here is the memory to move to the off-chip DRAM. Overall though, the idea is to keep data local and in SRAM instead of using the external DRAM as much as possible.

Tenstorrent Hot Chips 2024_Page_19
Tenstorrent Hot Chips 2024_Page_19

Ethernet is a big deal within the Tenstorrent architecture.

Tenstorrent Hot Chips 2024_Page_20
Tenstorrent Hot Chips 2024_Page_20

One of the key ideas is that Blackhole will use Ethernet to scale out. Ethernet has the advantage of having regular performance updates, and just about everyone uses it in the industry at some level of their architecture. This is how Tenstorrent is getting a lot of scaling without designing something like NVLink or InfiniBand.

Tenstorrent Hot Chips 2024_Page_22
Tenstorrent Hot Chips 2024_Page_22

The above shows a 2×2 mesh of Blackholes. AI has a lot of data locality, so these meshes are efficient.

The Blackhole Galaxy will have 32 chips in a 4 x 8 mesh topology.

Tenstorrent Hot Chips 2024_Page_23
Tenstorrent Hot Chips 2024_Page_23

The idea is that one can scale out by adding more boxes to the network.

Tenstorrent Hot Chips 2024_Page_24
Tenstorrent Hot Chips 2024_Page_24

TT-Metalium is part of the company’s low-level programming model to turn hardware into something useful for running AI.

Tenstorrent Hot Chips 2024_Page_26
Tenstorrent Hot Chips 2024_Page_26

Here is a bit on Tenstorrent open source software.

Tenstorrent Hot Chips 2024_Page_27
Tenstorrent Hot Chips 2024_Page_27

Here is another one on the integrations.

Tenstorrent Hot Chips 2024_Page_28
Tenstorrent Hot Chips 2024_Page_28

All cool stuff here.

Final Words

If you want to try out Tenstorrent hardware, we just covered Tenstorrent Wormhole Developer Kits Launched. It seems that Blackhole is not as simple to buy, but the systems that these go in are much larger.

Between RISC-V and the usage of Ethernet, Tenstorrent is going the way of open systems for AI acceleration, which is cool to see. If you want to know why we were so excited to see the Inside a Marvell Teralynx 10 51.2T 64-port 800GbE Switch piece and video, this is one reason that 51.2T Ethernet is going to be a big deal in the industry. AI chips like Blackhole are using high-end Ethernet to scale out.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.