NVIDIA Quadro RTX 8000 GPU Review

8

NVIDIA Quadro RTX 8000 Deep Learning Benchmarks

As we continue to innovate on our review format, we are now adding deep learning benchmarks. In future reviews, we will add more results to this data set. At this point, we have a fairly nice data set to work with.

ResNet-50 Inferencing Using Tensor Cores

ImageNet is an image classification database launched in 2007 designed for use in visual object recognition research. Organized by the WordNet hierarchy, hundreds of image examples represent each node (or category of specific nouns).

In our benchmarks for Inferencing, a ResNet50 Model trained in Caffe will be run using the command line as follows.

nvidia-docker run --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ~/Downloads/models/:/models -w /opt/tensorrt/bin nvcr.io/nvidia/tensorrt:18.11-py3 giexec --deploy=/models/ResNet-50-deploy.prototxt --model=/models/ResNet-50-model.caffemodel --output=prob --batch=16 --iterations=500 --fp16

Options are:
–deploy: Path to the Caffe deploy (.prototxt) file used for training the model
–model: Path to the model (.caffemodel)
–output: Output blob name
–batch: Batch size to use for inferencing
–iterations: The number of iterations to run
–int8: Use INT8 precision
–fp16: Use FP16 precision (for Volta or Turing GPUs), no specification will equal FP32

We can change the batch size to 16, 32, 64, 128 and precision to INT8, FP16, and FP32.

The results are in inference latency (in seconds.) If we take the batch size / Latency, that will equal the Throughput (images/sec) which we plot on our charts.

We also found that this benchmark does not use two GPU’s; it only runs on a single GPU. You can, however, run different instances on each GPU using commands like.
```NV_GPUS=0 nvidia-docker run ... &
NV_GPUS=1 nvidia-docker run ... &```

With these commands, a user can scale workloads across many GPU’s. Our graphs show combined totals.

We start with Turing’s new INT8 mode which is one of the benefits of using the NVIDIA RTX cards.

NVIDIA Quadro RTX 8000 ResNet50 Inferencing INT8 Precision
NVIDIA Quadro RTX 8000 ResNet50 Inferencing INT8 Precision

Using precision of INT8 is by far the fastest inferencing method if at all possible converting code to INT8 will yield faster runs. Installed memory has one of the largest impacts on these benchmarks which the Inferencing on NVIDIA RTX graphics cards does not tax the GPUs a great deal, however additional memory allows for larger batch sizes, the NVIDIA Quadro RTX 8000 could easily do larger batch sizes.

Let us look at FP16 and FP32 results.

NVIDIA Quadro RTX 8000 ResNet50 Inferencing FP16 Precision
NVIDIA Quadro RTX 8000 ResNet50 Inferencing FP16 Precision
NVIDIA Quadro RTX 8000 ResNet50 Inferencing FP32 Precision
NVIDIA Quadro RTX 8000 ResNet50 Inferencing FP32 Precision

As we expect, the Quadro RTX 8000 and Titan RTX are very close here. We could likely use larger batch sizes, however, we are starting to see diminishing returns even at 128.

ResNet-50 Training using Tensor Cores and TensorFlow

We also wanted to train the venerable ResNet-50 using Tensorflow. During training the neural network is learning features of images, (e.g. objects, animals, etc.) and determining what features are important. Periodically (every 1000 iterations), the neural network will test itself against the test set to determine training loss, which affects the accuracy of training the network. Accuracy can be increased through repetition (or running a higher number of epochs.)

The command line we will use is:

nvidia-docker run --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ~/Downloads/imagenet12tf:/imagenet --rm -w /workspace/nvidia-examples/cnn/ nvcr.io/nvidia/tensorflow:18.11-py3 python resnet.py --data_dir=/imagenet --layers=50 --batch_size=128 --iter_unit=batch --num_iter=500 --display_every=20 --precision=fp16

Parameters for resnet.py:
–layers: The number of neural network layers to use, i.e. 50.
–batch_size or -b: The number of ImageNet sample images to use for training the network per iteration. Increasing the batch size will typically increase training performance.
–iter_unit or -u: Specify whether to run batches or epochs.
–num_iter or -i: The number of batches or iterations to run, i.e. 500.
–display_every: How frequently training performance will be displayed, i.e. every 20 batches.
–precision: Specify FP32 or FP16 precision, which also enables TensorCore math for Volta and Turing GPUs.

While this script TensorFlow cannot specify individual GPUs to use, they can be specified by
setting export CUDA_VISIBLE_DEVICES= separated by commas (i.e. 0,1,2,3) within the Docker container workspace.

We will run batch sizes of 16, 32, 64, 128 and change from FP16 to FP32. Our graphs show combined totals.

NVIDIA Quadro RTX 8000 ResNet50 Training FP16 Precision
NVIDIA Quadro RTX 8000 ResNet50 Training FP16 Precision
NVIDIA Quadro RTX 8000 ResNet50 Training FP32 Precision
NVIDIA Quadro RTX 8000 ResNet50 Training FP32 Precision

Some GPUs like the new Super cards as well as the GeForce RTX 2060 Super, RTX 2070 Super, RTX 2080 Super, and RTX 2080 Ti will not complete higher batch size runs because of limited memory. This is another example where large memory sizes pay off. ResNet-50 is not exactly groundbreaking today. For researchers that simply need lots of memory to keep models on the GPU instead of traversing the PCIe complex to main memory, the Quadro RTX 8000 should be something to look at. It has 50% more memory than a Tesla V100S and costs less.

Deep Learning Training Using OpenSeq2Seq (GNMT)

While Resnet-50 is a Convolutional Neural Network (CNN) that is typically used for image classification, Recurrent Neural Networks (RNN) such as Google Neural Machine Translation (GNMT) are used for applications such as real-time language translations.

The command line we use for OpenSeq2Seq (GNMT) is as follows.

nvidia-docker run -it --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ~/Downloads/OpenSeq2Seq/wmt16_de_en:/opt/tensorflow/nvidia-examples/OpenSeq2Seq/wmt16_de_en -w /workspace/nvidia-examples/OpenSeq2Seq/ nvcr.io/nvidia/tensorflow:18.11-py3

We then open the en_de_gnmt-like-4GPUs.py and edit our variables.

vi example_configs/text2text/en-de/en-de-gnmt-like-4GPUs.py

First, edit data_root to point to the below path:
data_root = "/opt/tensorflow/nvidia-examples/OpenSeq2Seq/wmt16_de_en/"

Additionally, edit the num_gpus, max_steps, and batch_size_per_gpu parameters under
base_prams to set the number of GPUs, run a lower number of steps (i.e. 500) for
benchmarking, and also to set the batch size:
base_params = {
...
"num_gpus": 1,
"max_steps": 500,
"batch_size_per_gpu": 128,
...
},

We also edit lines 44 and below as shown to enable FP16 precision:

#"dtype": tf.float32, # to enable mixed precision, comment this
line and uncomment two below lines
"dtype": "mixed",
"loss_scaling": "Backoff",

We then run the benchmarks as follows.

python run.py --config_file example_configs/text2text/en-de/en-de-gnmt-like-4GPUs.py --mode train

The results will be Avg. Objects per second trained which we plot.

We should note that other GPU’s we used to like the RTX2060, RTX2070, RTX2080 and RTX2080 Ti could not complete this benchmark due to the lack of memory. To enable this benchmark to finish on these GPU’s one might need to lower the batch size to smaller values like 32, 16, 8. We tried this but had no luck, using a batch size 4 could be run but it was decided that this was not a very usable size.

NVIDIA Quadro RTX 8000 OpenSeq2Seq Training FP16 Mixed Precision
NVIDIA Quadro RTX 8000 OpenSeq2Seq Training FP16 Mixed Precision

As the NVIDIA Quadro RTX 8000 has 48GB of installed memory, double that of the Titan RTX. The Quadro RTX 8000 is easily equal to the Titan RTX but offers larger batch sizes on a single GPU.

NVIDIA Quadro RTX 8000 OpenSeq2Seq Training FP32 Mixed Precision
NVIDIA Quadro RTX 8000 OpenSeq2Seq Training FP32 Mixed Precision

While the GeForce RTX 2080 Ti cannot complete these benchmarks with 11GB of memory, the NVIDIA Tesla T4 is able to do up to a batch size of 64 using FP32. The Quadro RTX 8000 is able to utilize ECC memory like the T4, yet performance due to higher onboard memory enabling 192 batch sizes is close to 3x of the T4.

Next, we are going to look at the NVIDIA Quadro RTX 8000 power and temperature tests and then give our final words.

8 COMMENTS

  1. I disagree with the closing statement, the real competition to this card when you don’t need the memory capacity but do need additional performance is a dual RTX5000 setup and not a single RTX6000. Other than that great article and overview of the RTX8000

  2. For deep learning, or (most?) any machine learning, ECC RAM is unnecessary. 48GB is great though. The more the better.

  3. Something appears to have gone wrong with the Octane benchmark. There is typically a really small difference between the 2080ti and the RTX 8000. Are we sure those are the correct results? :)

  4. @Nejc The scene probably needs to address some out-of-core memory on 11GB VRAM cards which is not needed on RTX Titan & RTX 6000/8000

  5. Quadros would blow the doors off all these cards if they weren’t built for wait for it…. REDUNDANCY. Meaning ECC capabilities. Stop comparing Professional grade cards with basic consumer-grade cards. If you cannot comprehend what redundancy is or why it is needed on a Professional basis then you morons just need to learn to shut up. Quadros are not “gaming” cards. Idiots.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.