Graphcore Celebrates a Stunning Loss at MLPerf Training v1.0

19

Pricing and How NVIDIA Sells Data Center GPUs: The Quick Primer

Central to Graphcore’s comparison is its value proposition. Even comparing an optimized Open result to the standardized Closed division NVIDIA result, Graphcore’s IPU-POD16 is slowed than NVIDIA. Where it is showing benefit and focuses on in its blog, is price. To understand what is going on here, one needs to get into pricing and how the NVIDIA A100 is sold.

Unlike when we did the How to Install NVIDIA Tesla SXM2 GPUs in DeepLearning12 during the NVIDIA Tesla P100 era, NVIDIA now sells GPUs as part of three primary assemblies: PCIe, Redstone, and Delta.

The first and easiest to understand is in PCIe form factors. Here is a shot from our ASUS RS720A-E11-RS24U review. We will also have these in the GPU accelerated Dell EMC PowerEdge R750xa review that will likely go live in July. This form factor is perhaps the most familiar to folks.

ASUS RS720A E11 RS24U 4x NVIDIA A100 Stack 3
ASUS RS720A E11 RS24U 4x NVIDIA A100 Stack 3

As a quick note, depending on the system, the NVIDIA A100 PCIe cards can utilize NVLINK bridges. That depends on the system layout, but there is a GPU-to-GPU communication path without going over PCIe available even with the A100 PCIe cards. Some in the market assume that the NVIDIA A100 PCIe can only utilize PCIe for GPU-to-GPU communication but that is not entirely true.

2x NVIDIA A100 PCIe With NVLink Bridges Installed
2x NVIDIA A100 PCIe With NVLink Bridges Installed

The second way that the NVIDIA A100 is sold these days is in what is codenamed the “Redstone” platform or the HGX A100 4 GPU. This places four NVIDIA A100 GPUs on a PCB with direct NVLink between them and then is sold as a HGX platform for OEMs to integrate.

Supermicro HGX A100 4x GPU Board Redstone Example
Supermicro HGX A100 4x GPU Board Redstone Example

Above is a Supermicro rendering, but one can see these below in an image for the upcoming Dell EMC PowerEdge XE8545 review. The key change with Redstone is that NVIDIA is selling the four GPU solutions pre-assembled. If you saw our video on installing SXM2 GPUs you will see how delicate they are. One major OEM had a layer too thick of thermal paste on heatsinks for the V100 and managed to crack a large number of GPUs as a result. With Redstone, you do not buy SXM4 GPUs and install them on a vendor PCB. Instead, an OEM purchases the NVIDIA Redstone platform and integrates it into servers feeding it data, power, and cooling. The Dell EMC PowerEdge XE8545 below is a feat of engineering as it can air cool the four 500W A100 Redstone platforms so long as data center ambient temperatures are on the chilly side.

Dell EMC PowerEdge XE8545 EPYC 7003 To NVIDIA A100 Redstone
Dell EMC PowerEdge XE8545 EPYC 7003 To NVIDIA A100 Redstone

Then comes the “Delta” platform which is known as the HGX A100 platform (both names and markings are used on the actual part from NVIDIA.) Readers of STH will recognize this as we looked at the previous version in:

We will have the next-generation Inspur A100 server review live (hopefully) next week. Effectively this takes 8x NVIDIA A100 SXM4 packages along with NVSwitches, power delivery components, and cooling and puts them onto a single PCB. The assembly is then packaged as the HGX A100 platform and OEMs integrate it into their custom chassis.

NVIDIA HGX A100 GPUs Small
NVIDIA HGX A100 GPUs Small

Just to impress how standard the HGX A100 8 GPU “Delta” platform is, even OEM systems sold under a non-NVIDIA brand have the PCB clearly marked with NVIDIA along with the A100 heatsinks. One can see the smaller label (there are others on the PCB) which says “Delta” and HGX A100 8 GPU.

NVIDIA Delta Label And HGX A100 8 GPU Label
NVIDIA Delta Label And HGX A100 8 GPU Label

NVIDIA does not sell the majority of its SXM4 A100 systems in its DGX systems. Instead, DGX A100 platforms are often sold before OEMs get the HGX A100 in order to boost margins. Also, the DGX A100 is a set configuration. OEMs offer different customized configurations along with all of the rack integration, service, support, and management integrations that are part of their value add.

This is actually very important. NVIDIA is selling these Delta platforms to OEMs as pre-assembled units ready to integrate. Therefore, NVIDIA can extract value from its IP by providing this board. OEMs provide value by putting a server platform around it. As a great example, the aforementioned PowerEdge XE8545 uses effectively the same motherboard found in the Dell EMC PowerEdge R7525 we reviewed. We are going to go into the Graphcore packaging later, but while OEMs will build chassis around the HGX A100 8 GPU Delta board with their own server, with Graphcore they can use standard off the shelf servers.

Circling back to the results we saw above, this is why the Supermicro (1.0-1085) and NVIDIA (1.0-1059) results are so close. The main HGX A100 8 GPU assembly is effectively a standardized unit. In some ways conceptually similar to the A100 PCIe in terms of integration into a specific OEM’s platform, just scaled way up. That is one reason Jensen sometimes calling the 8x GPU HGX A100 a single GPU makes some sense if you understand the go-to-market model. Supermicro has some customizations with Intel CPUs and cooling, but running the same software stack that is primarily stressing the same HGX A100 GPU assembly gives us effectively the same result as the NVIDIA DGX A100.

Graphcore MLPerf Training V1.0 Closed Division Image Classification ResNet Results
Graphcore MLPerf Training V1.0 Closed Division Image Classification ResNet Results

This is key to the Graphcore discussion as Graphcore could not show a performance advantage. Instead, Graphcore focused on a price/ performance advantage, but Graphcore picked a relatively niche comparison point for pricing by using the low-volume DGX A100 instead of OEM platforms based on the same HGX A100.

NVIDIA HGX A100 Pricing

Graphcore used the NVIDIA-branded DGX A100 640GB (with 8x 80GB 400W GPUs) that it assigned a $299,999 price tag. Since 8x A100 systems are usually $100K+ systems, and they are usually purchased as a cluster, not as single units, pricing is generally negotiated. Companies are not going to Amazon’s storefront putting in a credit card and purchasing multi-million dollar clusters of GPU systems at list price without support. Pricing is negotiated and is usually at the cluster level, not the GPU server level.

Pricing, of course, varies. Word on the street is that, at a cluster level, moving from PCIe GPUs to SXM4 GPUs in the compute nodes is a ~10-20% price uplift for the same number of GPUs. The benefit is the higher performance which is why SXM4 GPUs are very popular and why scaling power in the A100 is important. Offering higher-power A100’s in SXM4 pushes buyers from PCIe systems to systems that use HGX board assemblies where NVIDIA can have more value add. One has to remember that when these AI training clusters are installed there is also networking (InfiniBand or Ethernet) involved and all of the switches, cabling, and so forth. There is also, at minimum, a high-speed tier of storage to feed GPU compute systems. Changes in GPU pricing are relatively small in the context of complete clusters.

On the topic of NVIDIA DGX A100 pricing, Graphcore is using $299,999 in its charts and has this to say about how it came up with that price:

The NVIDIA DGX-A100 640GB used in MLPerf is a 6U box with a list price of around $300,000 (based on market intelligence and published reseller pricing) and eight DGX A100 chips. (Source: Graphcore Blog)

Let us just get to it. HGX A100 platforms, when they are sold as single servers are generally in the $130K-$180K even leaving a very healthy margin for OEMs/ resellers. One can generally upgrade from the 40GB to 80GB Delta HGX A100 (320GB to 640GB) for around $16K, potentially more or less depending on the channel and discount structures. The 40GB to 80GB premium Graphcore calls “1.5x” but it is actually well under 20% for an OEM HGX A100 320GB v 640GB platform. If you have lower-end CPUs, memory, storage, and networking, and the 320GB HGX A100, you can be on the lower end of that scale. If you are using the 640GB HGX A100 and higher-end components, as used in the MLPerf configurations, you are likely on the higher end of the scale. Of course, we are assuming a very healthy margin and including on-site service in those numbers. A low-margin reseller passing along NVIDIA rebates, or big deal pricing will drop pricing from there.

It is hard to discuss deal-specific discounts, but that $130K-$180K range is very reasonable for a single HGX A100 at list configured pricing and without deal discounts. One can configure these systems online and get that list price range quoted without talking to a sales rep about further discounts. Instead of doing a basic “check the list price on a web configurator for a spec listed in results” bit of market research, Graphcore decided to use $299,999. Over the lifecycle of a box like this (once GPUs are available outside of just DGX systems), customers either migrate to OEM HGX A100 systems if they want price/ performance or they are getting something else as an effective discount to the DGX A100 pricing.

Bringing it together, we saw the Supermicro (1.0-1085) and NVIDIA (1.0-1059) being close in terms of performance. The street list price for a NVIDIA HGX A100-based system is about half of what Graphcore is using in its price/ performance comparison. While NVIDIA may not publish list prices, it really needs to be competitive in the market with the HGX A100 systems that offer more bundling, service, and customization options in terms of its pricing. “Market Intelligence” should have guided to a $150K-$180K price point for a comparable HGX A100-based server, again excluding deal discounts and leaving a healthy margin for channel partners. That may have been conservatively looking at pricing in Graphcore’s favor, but it would have been at least reasonable.

Graphcore Blog MLPerf Training V1.0 RESNET 50
Graphcore Blog MLPerf Training V1.0 RESNET 50

If we use the ultra-conservative high-end of this price range even at $180K which makes me cringe as being too high, using Graphcore’s comparison point above, it would show significantly worse price/performance than NVIDIA A100-based systems.

Now that we understand that the Graphcore IPU-POD16 is slower, NVIDIA’s go-to-market model, and why pricing is roughly equivalent for Graphcore and competitive NVIDIA-based offerings, we still need to look at other factors. One factor is power consumption which we will explore next.

19 COMMENTS

  1. This is why STH is great.
    – Vendor “Yo we’ve got some BS to peddle.”
    – STH “OK but here’s how it works and some actual data.”
    You’ve been critical of NVIDIA before so it is good to see you’re being consistent.

  2. We buy hundreds or thousands of GPU servers a year. You’re correct that we’re paying half of the $300K that Graphcore is using. Calling that “fantasy” is correct if you’re comparing real world price performance. We get a discount over what you’re using for the HGX A100’s but it isn’t that far off.

  3. You could’ve stopped with just people don’t pay 300K for a system with 8 11-15k GPUs because that’d be stupid. Great analysis tho. Graphcore sells to AI shops that have smart engineers. I don’t see why they’ve used 300K since that’s not real. We also have A100 systems and they weren’t 300K.

  4. You’re missing that the point of the article is not just about Graphcore.

    It’s really a cautionary article for other AI startups looking to submit to MLPerf.

    I’d only add that Google competes well in MLPerf but others shouldn’t even try.

  5. I agree that comparing list price of high-end equipment is of little use because the procurement policy at most organizations for such expensive stuff involves a special contract negotiated by a purchasing department.

    Thanks for including the energy analysis at the end of the article. From a technology point of view, performance per watt is much more interesting. It may also be more relevant in terms of total cost of ownership.

    In the future it would be nice to measure total energy and time needed for each system to complete the same task for a number of different machine learning related activities.

  6. You are right that Graphcore shouldn’t have participated in MLPerf.

    My comment is you should make the PCIe-RedStone-Delta its own article so it is easy to reference. That and the DELTA label pic are golden.

  7. Patrick, do you have any similar real-world pricing for the Graphcore solutions? NVIDIA is sold at 50% of MSRP but if Graphcore is similarly discounted then their comparison still holds.

  8. ListVLower if you read, he’s saying the LIST of the HGX A100 is in the 150-180k range before discounts. This is a very educated view of pricing because all this is priced at the cluster level and it’s not priced at the node.

  9. It’s a little silly to make the price comparison because any organization that is buying a specialized training computer is spending a multiple of what it costs on developers, “data scientists”, labeling and such.

    The devs already know something about the NVIDIA ecosystem and won’t feel like they’re wasting time learning more. (e.g. as it is widely used in ML, image synthesis, etc.) NVIDIA also has some nice tools that would be hard for GraphCore to match. Dev(s) can even develop with laptop & desktop cards w/ NVIDIA but w/ Graphcore Dev(s) will have to fight for access to the server and put up with a slower training cycle so add it all up and they will be less productive and you’ll pay back whatever you saved on the machine — if you really saved on the machine.

    In inference there is more the case new tech can be transformative, but not if we are talking a factor of 1.2x but more if there is a factor of 12x. (Incredible if memory bandwidth is anywhere near being the memory bottleneck)

  10. An alternative takeaway is that this is the first time we get real numbers out of these startups, and they’re not good. “Focusing on customers” is not going to do much if your platform is not competitive.

    I think saying that all these startups chose to not participate is being generous. I guarantee that everyone who has working hardware ran the suite. The absence of good results should be taken as a signal that they aren’t competitive, not that they didn’t bother trying.

    —–

    As always, great in-depth reporting!

  11. The text under the “ML Perf training: ResNet-50” image states
    “Graphcore highlights its 32.12 minutes to complete the benchmark while the NVIDIA DGX A100 takes 28.77 minutes which is the result 1.0-1059.”
    The time shown in that image for Graphcore is 37.12, not 32.12

  12. There seems to be typos in:
    “One major OEM had too think of thermal paste”
    and
    “Instead, and OEM purchases the NVIDIA Redstone platform and integrates it”
    You might like to fix. Thanks :-)

  13. This is one where STH shows why its #1 in server analysis. Usually the next platform does good stuff but they swung and missed on this one. The HGX A100 8 GPU bits, power, and the “cluster” view where the training systems are only part of the deployed solution are relevant and this is an insane level of depth.

    Just know we appreciate it.

  14. “We will have a review of the Inspur NF5488A5 next week, but the 1.0-1035 result is using the 500W variant so we should exclude it here. They have been offered by most OEMs for some time, but are not listed on NVIDIA’s current A100 data sheet which only lists 500W models.”

    only lists.. 400W models?

  15. Amazing stuff! For years i was getting the impression, that those gazillions of startups would be much more competetive than they now actually seem.
    That NVIDIA is holding up so good with GPU derivied silicon against supposedly hyper-optimized silicion exclusively design for AI is stunning.
    Just a naive view from a AI novice.
    Again, really amazing article!

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.