NVIDIA EGX A100 Launched Tesla Plus Mellanox Vision

7
NVIDIA EGX A100 Converged Accelerator Cover
NVIDIA EGX A100 Converged Accelerator Cover

One of the coolest cards we saw at the NVIDIA GTC 2020 keynote was not a pure GPU. Instead, it is the NVIDIA EGX A100. This is a marriage of a GPU and NIC all in a single PCIe card. The card features a NVIDIA A100 Ampere-based GPU package along with a Mellanox ConnectX-6 Dx NIC. That means one can get 200Gbps of networking plus a GPU on a single card.

NVIDIA EGX A100

The NVIDIA EGX A100 is a product NVIDIA needed to show at this GTC. With the recent Mellanox acquisition, NVIDIA needed to show it has a vision for combining fabrics and GPUs. That is exactly what we see wit the EGX A100.

NVDIA EGX A100 Overview
NVIDIA EGX A100 Overview

One can see the traditional PCIe (Gen4) card. One will quickly notice that there is a rounded edge, likely for a cooling solution. There are also two QSFP28 100Gbps ports. Using ConnectX-6 DX VPI IP, the company gets various networking and security offloads. One also can use the card to connect to either Infiniband or Ethernet fabrics.

NVDIA EGX A100 Ecosystem
NVIDIA EGX A100 Ecosystem

NVIDIA is already touting a large ecosystem for its EGX platform. In some ways, this is what Mellanox and NVIDIA were trying to accomplish with existing products. This is a new level of integration so hopefully, we will see new classes of solutions arise from this type of device.

The Impact

The impact of the NVIDIA EGX A100 is not saving a PCIe slot. Instead, it is NVIDIA moving in a direction of CPU offload. The vision of the SmartNIC capabilities is that the EGX A100 can be connected to the network via Infiniband or Ethernet. Another option is that one can use Infiniband for GPU-to-GPU communication and Ethernet to get data from NVMeoF storage. That data can then be securely moved to the onboard NVIDIA A100 GPU. That GPU can do processing it needs then send data back out over the network, without host CPU intervention.

If one looks at what NVIDIA is doing with this product, it is essentially the first step in disaggregating the x86-based CPU servers from GPU compute. While these cards are likely still to be used in PCIe slots in standard servers, the EGX A100 gives an opportunity to show real bypass of the host system. As we discussed over a year ago when NVIDIA moved to purchase Mellanox in NVIDIA to Acquire Mellanox a Potential Prelude to Servers, the next step is a BlueField version of this device.

Assuming NVIDIA is relentless in moving to this model, there are enormous cost savings. NVIDIA is already working on full pipeline offload for large application areas such as Apache Spark 3.0. The next step is adding network-attached GPUs to existing clusters to greatly speed up workloads without adding new x86 servers from competitors such as AMD or Intel.

While the Ampere generation A100 is a big deal, the EGX A100 may be the most impactful if we look back to this announcement five years from now.

Note: We got late word that this is now the NVIDIA A100 without the “Tesla” branding.

7 COMMENTS

  1. Is the NIC connected directly to the GPU? I don’t see a PCIe switch, very interesting technology

  2. @Thomas
    It may not need any if the nvLink remains based upon PCIe. The A100 reportedly has 12 nvLink ports so it has plenty to spare on the PCIe card form factor even after the two nvLink ports on the top of the card and the PCIe 16x connector.

  3. Hmm, could we then see a multi-chiplets design with GPU chiplets, DPU (bluefield) chiplets (Jensen likes to call them Data Processing Units), and custom ARM CPU chiplets on a single interposer? Now you don’t need an x86 host at all.

  4. @ReaktorField I think the A100 is already at the interposer size limit. But better interconnects are coming, and Nvidia will undoubtely be one of the first to use them.

    And in a way, Jetson is that kind of product, even though its a monolithic SoC.

  5. This is designed for high frequency trading shops where latency must be super low at all cost. This goes way beyond RDMA. Awesome tech.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.