HPE is well-known for being one of the largest server vendors in the world, pushing boundaries at different corners of the market with a scale ranging from small SMBs to the largest corporations. Greg Schmidt, Product Manager, Apollo HPC and AI at HPE was gracious enough to sit down and answer the questions on behalf of HPE for our Deep Learning and AI Q3 2019 Interview Series. As one of the longtime leaders in the HPC space, and with the recent convergence of HPE and AI architectures, we wanted to get a sense of trends HPE sees in the market.
HPE Talks Deep Learning and AI Q3 2019
In this series, we sent major server vendors several questions and gave them carte blanche to interpret and answer them as they saw fit. Our goal is simple, provide our readers with unique perspectives from the industry. Each person in this series is shaped by their background, company, customer interactions, and unique experiences. The value of the series is both in the individual answers, but also what they all say about how the industry views its future.
Training
Who are the hot companies for training applications that you are seeing in the market? What challenges are they facing to take over NVIDIA’s dominance in the space?
From a hardware standpoint, there are many startups, and quite a few large and established market competitors trying to design better ASIC for deep learning (DL) training. NVIDIA was both lucky and smart early in the DL explosion. Lucky in that GPUs are naturally designed to do massive vector and matrix types of math, and DL requires just that; smart in that they recognized that DL was a fundamentally better approach to AI than used in the past, and they invested early and significantly, in building competent products for DL.
Deep learning models did not require “subject matter experts” in bulk to work with coders and build machine learning (ML) code. Instead, the DL codes/models would figure out what mattered, what the key patterns are, and solve the problem. From there, a problem solved for one effort was usually easily transferable to another similar effort.
For example, a model designed to convert U.S. English speech to text could be retrained for Chinese language with very minor changes by mainly increasing the size of the output layers to account for Chinese language characters being whole words, not just letters.
How are form factors going to change for training clusters? Are you looking at OCP’s OAM form factor for future designs or something different?
The most well-known example here is NVIDIA’s own NVLINK enabled SXM GPU, but we also see various startups with a range of custom designs, and proposals from major players (OCP OAM, Intel, etc) around to bring out custom form factors. The benefits from the hardware side for the ASIC vendor are fairly obvious: more freedom to design high-speed communications between the ASICs more freedom in power envelope, etc. These designs are likely to make “the ultimate” performance package for an ASIC.
What is usually missed is the downside of departing from uniform standards. From the customer’s standpoint, PCIe is a recognized standard and provides upgrade flexibility for years. Conversely, from a server manufacturer perspective, PCIe for a while was “stuck” at Gen3, which arguably was not sufficient for all accelerator workloads. In fact, for a while, it looked like deep learning would REALLY benefit from vastly faster communications between GPUs as there was a tremendous amount of comparisons between information in GPU resident memory for DL models.
In the race to gain more accurate models that exceed human capability, DL models grew each year, with the top models being about 10x more computationally complex. Today, top models are largely computationally bound-up, and the fraction of time spent in communications is low enough that it seldom matters for overall performance. Simple models still benefit immensely from greater communication bandwidths, but they are “simple” and don’t take much time to complete training in any case.
You could argue that for the majority of the customers, PCIe is “good enough.”
PCIe is now out of an almost decade-long rut and is racing down a path to Gen4, and even Gen5, in a few years. New form factors are tough to plan for servers. With a CPU, you have a pretty good idea of what is coming and when 2-3 years ahead of time, and it is largely “iterating” on past success, so you can focus on getting the supporting elements “right”. Custom form factors, for relatively small markets, with far less advance notice, are harder.
What kind of storage back-ends are you seeing as popular for your deep learning training customers? What are some of the lessons learned from your customers that STH readers can take advantage of?
A few years ago, deep learning at the cluster level was predominantly using high-performance computing (HPC) technologies and techniques such as MPI, Lustre, etc. As we move into a bigger market opportunity, two interesting things have occurred, adding to the techniques we see deployed:
- We see more folks trying to leverage “what you have” as enterprise customers enter DL and try to use existing data stores be they originally for SAP, Hadoop, etc.
- There is a need for specialized high-speed storage for big Deep Learning problems that demand both high capacity and fast access, such as WekaIO Matrix which is a high performance, scalable, and parallel file system that is ideal for AI, technical computing, and mixed workloads.
What storage and networking solutions are you seeing as the predominant trends for your AI customers? What will the next generation AI storage and networking infrastructure look like?
A few years back, the solutions were “HPC”- all Infiniband and Lustre or Infiniband and custom file systems, built out by the large service providers. Similar to the above storage discussion, we see more and more use of “enterprise” technologies for the network. The broad support of RDMA over Converged Ethernet (RoCE) has clearly enabled customers to leverage and expand using technologies they are more familiar with, to provide robust network configurations at an enterprise data center.
Over the next 2-3 years, what are trends in power delivery and cooling that your customers demand?
Power and cooling might be the leading constraint for many customers wishing to deploy powerful deep learning training clusters. As an example, today’s HPE Apollo 6500 Gen10 systems are ideal infrastructures for HPC and Deep Learning with support for up to 8 NVLink enabled NVIDIA GPU, but consumes around 3-3.5kW each. And the trend from all accelerator vendors is to keep going up in power at the node level. On the flip side, most data centers simply don’t have the power and cooling capacity to support more than a handful of servers at this level. So how do customers respond?
Customers who are more advanced in DL or who have been deploying for years are upgrading their datacenters to support these higher power systems with either higher capacity air cooling plans, or even direct liquid cooling in their future. For immediate needs, many customers are turning to solutions like the HPE Adaptive Rack Cooling System, which uses facilities’ water to increase computing power without adding to the heat load in the data center. The HPE Adaptive Rack Cooling system supports up to 150 kW and 10,000 cfm of cooling capacity to allow full rack densities.
What should STH readers keep in mind as they plan their 2019 AI clusters?
Don’t wait to get everything perfect! It is far, far more important to begin gaining your expertise today, develop some successful solutions and get going. I had a customer a few years back that I spoke to at a show for 10-15 minutes with this message, and he took it to heart. The next year he sought me out and nearly hugged me: They had started out small, got a single HPE Apollo 6500 system, found their data, cleaned it, and done a few pilots. The pilots were hugely successful, helped them improve customer satisfaction and at the same time, reduced time to revenue for their business. Deep learning moved from “experimentation” to “top tool” in only one year! The AI revolution may be a “winner take all” game in some industries, so the sooner you get started, the better.
Inferencing
Are you seeing a lot of demand for new inferencing solutions based on chips like the NVIDIA Tesla T4? Are your customers demanding more FPGAs in their infrastructures?
Deep learning algorithms, once developed, must be used/deployed for business value. Most enterprise customers still find that they can deploy the inference part of the work on traditional CPU, leveraging their existing infrastructure efficiently. For those who “need more,” whether it is higher throughput inferencing – for video analysis, as an example – or lower latency (many financial institutions for example) often turn to T4 and FPGA. We definitely see both expanding and HPE is working closely with both NVIDIA for GPUs, and Intel and others for FPGA.
An FPGA offers high I/O bandwidth, plus a fine-grained, flexible and custom parallelism, allowing it to be programmed for many different types of workloads, including Big Data analytics, financial services, and deep learning. If a GPU is something like a hammer, an FPGA is like Doctor Who’s sonic screwdriver, an adaptable tool that can be used to solve many different types of problems.
Ultimately, it is a big and diverse market. Implementing the right deep learning solution can be complex. What system and accelerators are best? Where can you get advice? Which ISV should you use? HPE understands AI. Our proven, practical approach, validated solutions and partners, AI-optimized infrastructure and AI software platform – reduce complexity and help you realize the value of your data faster helping you gain a competitive edge. HPE is ensuring our customers have a broad set of options whatever they might need.
Who are the big accelerator companies that you are working with in the AI inferencing space?
HPE is actively working with NVIDIA, Intel, AMD, and a host of companies focused on ASIC technologies from FPGA to custom ASICS.
Are there certain form factors that you are focusing on to enable in your server portfolio? For example, Facebook is leaning heavily on M.2 for inferencing designs.
The top ten to twenty-five internet companies are far down the path of using Inference in their business, whether it is to respond to searches, display relevant products at the right time, or even to manage their infrastructure. They are moving fast and experimenting with many novel approaches to inference, and they often have enough scale that it makes sense for them to customize and develop on their own. The challenge is that these solutions are often so “new” that getting software support, certifications and support globally can be impossible.
An enterprise customer needs the same performance, but also requires “more” than simply performance – they need to meet their corporate data center requirements and certifications. HPE works with vendors like NVIDIA, Intel, AMD, and others, to provide high performance solutions with those “extras” a corporate buyer must-have, whether it is support for the broadest range of Operating Systems, the “right” networking card or fiber channel adapter, or simply reliable support tailored to the customer wherever and whenever they need it. We have this nailed!
HPE Apollo 6500 Gen10 systems have enough capacity to, individually, solve immense problems like updating a DL model to recognize all the passengers flying through a major international airport, overnight.
Customers can also quickly scale out simply and efficiently with our new HPE Bluedata GPUaaS and build specialized distributed deep learning clusters, which also allow them to carve up and allocate users 2, 4, 8 GPU.
What percentage of your customers today are looking to deploy inferencing in their server clusters? Are they doing so with dedicated hardware or are they looking at technologies like 2nd Generation Intel Xeon Scalable VNNI as “good enough” solutions?
Most of our customers are using inferencing today, and they might not even recognize it. AI is pervasive in many solutions for network security, anti-virus, etc. And, by and large, it is done on Intel Xeon. We have customers running huge memory databases on our SuperDome Flex Servers, simultaneously doing high volume/low latency inference on transactions, and recursively training the model, all in one server, with up to 16 accelerators.
What should STH readers keep in mind as they plan their 2019 server purchases when it comes to AI inferencing?
Get started on Training, find success in POC and pilots, and leverage your success to expand – for learning resources, visit Deep Learning for Dummies.
Application
How are you using AI and ML to make your servers and storage solutions better?
HPE InfoSight uses cloud-based machine learning to provide global insights into the status and health of infrastructure, all in one location. It predicts and prevents problems across the infrastructure stack and ensures optimal performance and efficient resource use. Similarly, Aruba Networks, an HPE company, applies machine learning to a broad range of networking and security challenges including RF optimization, network assurance, IoT visibility and advanced attack detection. By selecting the right problems to address, having access to sufficient data and knowing what data is most relevant- then curating the results in the most demanding of real-world environments- Aruba sets the standard in delivering AI-powered secure, autonomous networks.
Where and when do you expect an IT admin will see AI-based automation take over a task that is now so big that they will have a “wow” moment?
Today, with HPE InfoSight. HPE InfoSight constantly analyzes millions of sensor data every minute, allowing all customers to benefit as their systems get smarter and more reliable.
Final Words
Thank you to Greg for taking time to answer our Q&A. STH has been around for over ten years and I checked, this was our first Doctor Who reference. If you skimmed the section around getting started sooner rather than waiting for perfection, I highly suggest going back to Greg’s “Don’t wait to get everything perfect!” anecdote. These are exactly the kinds of great tidbits the series is designed to bring to our readers.
Great interview, thank you!