IBM Z Telum Mainframe Processor at Hot Chips 33

August 23, 2021

The IBM Z Telum processor is the new processor designed for mainframes. These chips are not really competitive with traditional x86 processors just because they are designed for very specific markets such as transaction processing and financial services. This is being done live, so please excuse typos.

IBM Z Telum Mainframe Processor at Hot Chips 33

The IBM Z Telum processor is designed to accelerate many of the ultra-high value workloads that are run at places like financial institutions processing credit card and banking transactions. As a result, customers are willing to pay quite a bit for not just performance, but also low latency, reliability, and security.

Like we are seeing on the x86 side, we have encrypted memory and trusted execution. This trend is reaching all of the server markets and IBM has been focused on these capabilities for some time.

New error correction and sparing where even an entire L2 cache array has an issue to drive beyond 99.99999% availability (seven 9s.) This is a 7nm Samsung chip with 22.5 billion transistors and over a 5GHz base clock. To get some sense of the achievement, IBM has a huge chip that is running at the speeds that enthusiast overclockers would be happy with, and running it reliably.

A lot of the effort is around per-core performance improvements. This core uses SMT2 and can run after. Integrated L1 and L2 branch prediction. Dynamic BTB entry reconfiguration is designed to dynamically get large amounts of branch entries. L2 cache is quadrupled versus Z15 to 32MB. Four pipelines go to the L2 to allow overlapping traffic.

HC33 IBM Z Telum Processor Core And L2 Cache

Virtual L3 and L4 caches from the L2 cache. All of the L2 caches are interconnected with ring infrastructure. When a cache line is evicted, a L2 cache can evict to another L2 cache on the chip for a virtual L3 cache. It can then scale across up to 8 chips for a virtual L4 cache.

HC33 IBM Z Telum Processor Bigger And Faster Caches

IBM Telum can scale up to 32 chips that are interconnected. The basic building blocks are modules with 1 or 2 chips. These can then be placed into a 4-socket drawer. Then four drawers can be combined.

HC33 IBM Z Telum Processor Building Large Scale Systems

Here is a look at a 2-chip package that has two of these 256MB cache chips for a total of 512MB. IBM is heading to the GB Onboard Era in the future.

HC33 IBM Z Telum Processor 1 And 2 Chips

By removing the cache chip in the drawer, and using the virtual L4 cache, fewer hops are required but that fabric and interface had to move on-chip.

HC33 IBM Z Telum Processor Fabric And Interface Optimizations

The net result of the improvements is a 40% per socket performance increase.

Each IBM Z generation tends to be a big jump in capability and performance just because of the speed of the market.

IBM Z Telum AI Inference Accelerator

IBM now has an AI inference accelerator directly on the chip. This is for high performance and low latency inference. The CPU cores do not have AI accelerators in the cores. As a result, each core can access the full accelerator directly instead of only having access to the performance of the AI inference in each core. For example, on an Intel Xeon, each core only has access to the performance of the VNNI. To get more VNNI performance, more cores need to execute.

HC33 IBM Z Telum Processor Embedded AI Inference

One of the important aspects here is that IBM having an AI accelerator on-chip means it does not have to go off-chip. In the mainframe industry, security is important so having all of the firmware and encryption around the processor apply to the AI accelerator is important as well.

HC33 IBM Z Telum Processor Embedded AI Accelerator With Z Processor Cores

The performance of the AI accelerator is over 6 TFLOPS or up to 200 TFLOPS in a 32-chip system.

HC33 IBM Z Telum Processor Embedded AI Accelerator Compute Arrays

Part of this is also ensuring IBM can get enough bandwidth to the AI inference array.

HC33 IBM Z Telum Processor Embedded AI Accelerator Data Movers

IBM says one can train on NVIDIA GPUs or elsewhere. Export to ONNX as more of an open model. Then the IBM Deep Learning accelerator can be used to embed into applications on the IBM Z processor.

HC33 IBM Z Telum Processor Integrate AI Into Existing Stacks

The basic idea is that AI can be embedded at scale. For example, one can do credit card fraud scoring directly on the chip at very low latency. This means real-time inference at scale so it can be embedded into transactions.

HC33 IBM Z Telum Processor AI Performance

This is a key use-case for IBM Z so we can see why IBM is showcasing this as it has a huge dollar value to its customers.

Final Words

IBM also has a huge number of reliability-related capabilities so that if parts of the chips and systems fail, they continue to run. We discussed issues in cache, and basic ECC memory is a given, but IBM is able to recover from additional types of failures well beyond x86. High performance and high reliability (along with legacy code compatibility) is exactly what those keeping the financial industry running want.

1 COMMENT

Rob August 23, 2021 At 6:16 pm

IBM has a Telum video available: https://www.youtube.com/watch?v=TRBgBbvYiQw

BTW: Telum means javelin in Latin.

Sure wish they made a motherboard that accepted one CPU.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

IBM Z Telum Mainframe Processor at Hot Chips 33

IBM Z Telum AI Inference Accelerator

Final Words

RELATED ARTICLESMORE FROM AUTHOR

The New IBM z17 Telum II Processor Module Cut Open Down to Silicon

Why One DIMM Per Channel or 1DPC Can Be Great for AMD EPYC 4004

Running the Deepseek-R1 671B Model at FP16 Fidelity Alongside Virtualized Workloads

1 COMMENT

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR