Tải bản đầy đủ (.pptx) (14 trang)

Inference platform for brainshark

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.37 MB, 14 trang )

NVIDIA AI INFERENCE
PLATFORM


AI INFERENCE IS EXPLODING
Creating a $20 Billion Opportunity in Next 5 Years

1 Billion

1 Billion

1 Trillion

Videos Watched Per Day
Facebook

Voice Searches Per Day
Google, Bing, etc.

Ads/Rankings Per Day
Impressions

LIVE VIDEO

SPEECH

RECOMMENDATIONS
2


GPU INFERENCE ADOPTION IS


ACCELERATING
VISUAL SEARCH

VIDEO ANALYSIS

ADVERTISING

INFERENCE USE
CASES
Image Maps
NLP
Search

Speech

60X Latency
Improvement RealTime Search

12X Faster Inference
Live Video Analysis

40X Higher
Performance RealTime Brand Impact

Video

Tesla P4, TensorRT
Rapid Adoption

3



CONVOLUTIONAL
NETWORKS

RECURRENT
NETWORKS

A CAMBRIAN
EXPLOSION OF
DL MODELS

Encoder/Decode
r

ReLu

Concat

Dropout

BatchNorm

Pooling

GENERATIVE
ADVERSARIAL
NETWORKS

LSTM


GRU

Beam Search

WaveNet

CTC

Attentio
n

REINFORCEMENT LEARNING

3D-GAN

Coupled GAN

MedGAN

Conditional GAN

Speech Enhancement
GAN

NEW SPECIES

Capsule Nets

DQN


Simulation

DDPG

Mixture of
Experts

Neural
Collaborativ
e Filtering

Block Sparse
LSTM
4


NEURAL NETWORK COMPLEXITY IS
EXPLODING
Bigger and More Compute Intensive

1.9 GB
480
MB

U-Net

DeepSpeech
3
MaskRCNN

DeepSpeech 2

ResNet50

DeepSpeec
h
2011

2012

2013

2014

Speech

2015

2016

GoogleN
et
Classificatio
2011
2012
n

Segmentatio
2013
2014

n

Enhanceme
2015
2016
nt

Images
5


INEFFICIENCY LIMITS INNOVATION
Difficulties with Deploying Data Center Inference
Single Model Only

Single Framework Only

Custom Development

!

ASR

NLP

Recommender

Some systems are
overused while others are
underutilized


Solutions can only support
models from one framework

Developers need to reinvent
the plumbing for every
application
6


ANNOUNCING
TENSORRT HYPERSCALE INFERENCE
PLATFORM

WORLD’S MOST ADVANCED
INFERENCE GPU

INTEGRATED INTO
FRAMEWORKS & ONNX
SUPPORT

TENSORRT
INFERENCE SERVER
7


ANNOUNCING TESLA T4
WORLD’S MOST ADVANCED INFERENCE GPU
Universal Inference Acceleration
320 Turing Tensor cores

2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4
TOPS
16GB | 320GB/s

8


NEW TURING TENSOR
CORE
MULTI-PRECISION FOR AI INFERENCE
65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4


ANNOUNCING
NVIDIA TensorRT 5

FRAMEWOR
KS

GPU
PLATFORMS

Fastest Deep Learning Inference Platform

TESLA P4

TensorRT
Optimizer


Data Center, Embedded & Automotive

JETSON TX2

Runtime

DRIVE PX
2

In-framework support for TensorFlow

NVIDIA
DLA
TESLA V100

Support for all other frameworks and ONNX
Containerized Inference Serving Engine
Docker and Kubernetes integration
New Layers and APIs
New OS Support for Windows and CentOS
Layer &
Tensor
Fusion

*New in TRT5

developer.nvidia.com/tensorrt

Precision
Calibrati

on

Kernel
Auto-Tuning

Dynamic
Tensor
Memory

10


ANNOUNCING NVIDIA TENSORRT
HYPERSCALE
Containerized Microservice For Data Center Inference

DNN Models

Kubernetes and Docker on NVIDIA
GPUs
New Inference Serving Engine

NV DL SDK
NV Docker
TensorRT Inference Server

Kubernetes

Multiple Model Types and
Frameworks Concurrently

Maximize Datacenter Throughput
and Utilization

11


WORLD’S MOST PERFORMANT INFERENCE
PLATFORM

260

250

TFLOPS / TOPS

200

Speech Inference
25
21X

20

130

50
0

5.5


30

10

25

P4

T4

CPU S erver
Tesla T4

30

20
15

10X

5

5

1.0

0

Float INT8 INT4


36X

35

10X

10

4X
1.0

Float INT8

40

25

10

5

22

27X

15

100
65


Natural Language Processing Inference

Video Inference

20

15

150

Speedup v. CPU Server

300

S peedup v. CPU S erver

Peak Performance

S peedup v. CPU S erver

Up To 36X Faster Than CPUs | Accelerates All AI Workloads

Tesla P4

-

Speedup: 21X
faster
DeepSpeech 2


0
CPU Server
Tesla T4

Tesla P4

-

Speedup: 27x
faster
ResNet-50 (7ms latency
limit)

1.0

0
CPU Server
Tesla T4

Tesla P4

-

Speedup: 36x faster
GNMT

12


DRAMATIC SAVINGS FOR CUSTOMERS

Game-Changing Inference Performance

INFERENCE
WORKLOAD
:
Speech, NLP and
Video

200 CPU Servers
60 KWatts

INFERENCE
WORKLOAD
:
Speech, NLP and
Video

1 T4 Accelerated Server
2 KWatts
13


NVIDIA INFERENCE MOMENTUM

Image Tagging

Video Analysis

Finding Music


Sports
Performance

Advertising
Impact

Video Captioning

Cybersecurity

Visual Search

Customer Service

Visual Search

Industrial
Inspection

Voice Recognition
14



×