With the latest set of MLPerf results, there is a clear trend: NVIDIA and Google dominate. Both the NVIDIA Tesla V100-based solutions as well as the Google TPU solutions came away with the majority of benchmark wins.
MLPerf v0.6 Results
Improvements in the MLPerf v0.6 include the following:
- Raises quality targets for image classification (ResNet) to 75.9%, light-weight object detection (SSD) to 23% MAP, and recurrent translation (GNMT) to 24 Sacre BLEU. These changes better align the quality targets with state of the art for these models and datasets.
- Allows use of the LARS optimizer for ResNet, enabling additional scaling.
- Experimentally allows a slightly larger set of hyperparameters to be tuned, enabling faster performance and some additional scaling.
- Changes timing to start the first time the application accesses the training dataset, thereby excluding startup overhead. This change was made because the large scale systems measured are typically used with much larger datasets than those in MLPerf, and hence normally amortize the startup overhead over much greater training time.
- Improves the MiniGo benchmark in two ways. First, it now uses a standard C++ engine for the non-ML compute, which is substantially faster than the prior Python engine. Second, it now assesses quality by comparing to a known-good checkpoint, which is more reliable than the previous very small set of game data.
- Suspends the Recommendation benchmark while a larger dataset and model are being created. (Source: MLPerf.org)
You can check out the results including Google TPU V3’s, NVIDIA Tesla V100’s in the company’s DGX systems, and Intel Cascade Lake generation results here. There are not many so we will let our users browse them.
Final Words
For AI hardware startups, as well as both AMD and Intel, this should be concerning. NVIDIA is showing scaling to supercomputer scale, being pushed by Google TPUs. Software has a major implication for performance on scale-out training and Google’s cloud experience plus NVIDIA’s supercomputer experience, aided by the announced Mellanox deal, are paying dividends. We covered the NVIDIA DGX Superpod 96 Node Top500 Supercomputer recently. NVIDIA is setting these records using 2017-era GPUs. With the Superpod sitting in the #22 spot for supercomputers, NVIDIA is showing its parts are not just AI enabled, but can do simulation along with AI. When your investment is $50M+ in hardware, keeping it filled with work is important and having more types of jobs the machine is good at helps ensure it does not rest.
From a MLPerf perspective, these results are good, but also difficult. MLperf is being taken in the direction where it takes tens of millions of dollars to get these benchmark results. For those buying a 1, 2, 4, 8 or 10 GPU systems not based on the Tesla V100 or a few racks of lower cost systems, these results are not useful. Likewise, Intel with its 2-node dual Platinum 9200 series result is showing a server that OEMs are not even supporting.
MLPerf needs to get results that are indicative of more hardware configurations to help support a broader ecosystem.
Thanks, Cliff! Interesting points.