NVIDIA AI INFERENCE
PLATFORM
AI INFERENCE IS EXPLODING
Creating a $20 Billion Opportunity in Next 5 Years
1 Billion
1 Billion
1 Trillion
Videos Watched Per Day
Facebook
Voice Searches Per Day
Google, Bing, etc.
Ads/Rankings Per Day
Impressions
LIVE VIDEO
SPEECH
RECOMMENDATIONS
2
GPU INFERENCE ADOPTION IS
ACCELERATING
VISUAL SEARCH
VIDEO ANALYSIS
ADVERTISING
INFERENCE USE
CASES
Image Maps
NLP
Search
Speech
60X Latency
Improvement RealTime Search
12X Faster Inference
Live Video Analysis
40X Higher
Performance RealTime Brand Impact
Video
Tesla P4, TensorRT
Rapid Adoption
3
CONVOLUTIONAL
NETWORKS
RECURRENT
NETWORKS
A CAMBRIAN
EXPLOSION OF
DL MODELS
Encoder/Decode
r
ReLu
Concat
Dropout
BatchNorm
Pooling
GENERATIVE
ADVERSARIAL
NETWORKS
LSTM
GRU
Beam Search
WaveNet
CTC
Attentio
n
REINFORCEMENT LEARNING
3D-GAN
Coupled GAN
MedGAN
Conditional GAN
Speech Enhancement
GAN
NEW SPECIES
Capsule Nets
DQN
Simulation
DDPG
Mixture of
Experts
Neural
Collaborativ
e Filtering
Block Sparse
LSTM
4
NEURAL NETWORK COMPLEXITY IS
EXPLODING
Bigger and More Compute Intensive
1.9 GB
480
MB
U-Net
DeepSpeech
3
MaskRCNN
DeepSpeech 2
ResNet50
DeepSpeec
h
2011
2012
2013
2014
Speech
2015
2016
GoogleN
et
Classificatio
2011
2012
n
Segmentatio
2013
2014
n
Enhanceme
2015
2016
nt
Images
5
INEFFICIENCY LIMITS INNOVATION
Difficulties with Deploying Data Center Inference
Single Model Only
Single Framework Only
Custom Development
!
ASR
NLP
Recommender
Some systems are
overused while others are
underutilized
Solutions can only support
models from one framework
Developers need to reinvent
the plumbing for every
application
6
ANNOUNCING
TENSORRT HYPERSCALE INFERENCE
PLATFORM
WORLD’S MOST ADVANCED
INFERENCE GPU
INTEGRATED INTO
FRAMEWORKS & ONNX
SUPPORT
TENSORRT
INFERENCE SERVER
7
ANNOUNCING TESLA T4
WORLD’S MOST ADVANCED INFERENCE GPU
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4
TOPS
16GB | 320GB/s
8
NEW TURING TENSOR
CORE
MULTI-PRECISION FOR AI INFERENCE
65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4
ANNOUNCING
NVIDIA TensorRT 5
FRAMEWOR
KS
GPU
PLATFORMS
Fastest Deep Learning Inference Platform
TESLA P4
TensorRT
Optimizer
Data Center, Embedded & Automotive
JETSON TX2
Runtime
DRIVE PX
2
In-framework support for TensorFlow
NVIDIA
DLA
TESLA V100
Support for all other frameworks and ONNX
Containerized Inference Serving Engine
Docker and Kubernetes integration
New Layers and APIs
New OS Support for Windows and CentOS
Layer &
Tensor
Fusion
*New in TRT5
developer.nvidia.com/tensorrt
Precision
Calibrati
on
Kernel
Auto-Tuning
Dynamic
Tensor
Memory
10
ANNOUNCING NVIDIA TENSORRT
HYPERSCALE
Containerized Microservice For Data Center Inference
DNN Models
Kubernetes and Docker on NVIDIA
GPUs
New Inference Serving Engine
NV DL SDK
NV Docker
TensorRT Inference Server
Kubernetes
Multiple Model Types and
Frameworks Concurrently
Maximize Datacenter Throughput
and Utilization
11
WORLD’S MOST PERFORMANT INFERENCE
PLATFORM
260
250
TFLOPS / TOPS
200
Speech Inference
25
21X
20
130
50
0
5.5
30
10
25
P4
T4
CPU S erver
Tesla T4
30
20
15
10X
5
5
1.0
0
Float INT8 INT4
36X
35
10X
10
4X
1.0
Float INT8
40
25
10
5
22
27X
15
100
65
Natural Language Processing Inference
Video Inference
20
15
150
Speedup v. CPU Server
300
S peedup v. CPU S erver
Peak Performance
S peedup v. CPU S erver
Up To 36X Faster Than CPUs | Accelerates All AI Workloads
Tesla P4
-
Speedup: 21X
faster
DeepSpeech 2
0
CPU Server
Tesla T4
Tesla P4
-
Speedup: 27x
faster
ResNet-50 (7ms latency
limit)
1.0
0
CPU Server
Tesla T4
Tesla P4
-
Speedup: 36x faster
GNMT
12
DRAMATIC SAVINGS FOR CUSTOMERS
Game-Changing Inference Performance
INFERENCE
WORKLOAD
:
Speech, NLP and
Video
200 CPU Servers
60 KWatts
INFERENCE
WORKLOAD
:
Speech, NLP and
Video
1 T4 Accelerated Server
2 KWatts
13
NVIDIA INFERENCE MOMENTUM
Image Tagging
Video Analysis
Finding Music
Sports
Performance
Advertising
Impact
Video Captioning
Cybersecurity
Visual Search
Customer Service
Visual Search
Industrial
Inspection
Voice Recognition
14