NEURAL
TURING
MACHINE
Nguyen Hung Son
Based on the PhD thesis of
Dr Karol Kurach
Warsaw University/Google
CuuDuongThanCong.com
/>
Agenda
Introduction to Deep Neural
Architectures
2. Neural Random Access-Machines
3. Hierarchical Attentive Memory
4. Applications: Smart Reply
1.
CuuDuongThanCong.com
/>
A primer on
Deep Learning
CuuDuongThanCong.com
/>
How intelligent are Neural Networks
two main criticisms
CRITICISM
SOLUTION
1.Neuralnetworkswithfixed-size
inputsareseeminglyunableto
solveproblemswithvariable-size
inputs.
RecurrentNeuralNetworks
(RNN):
- translatingasentence,or
- recognizinghandwrittentext
2.neuralnetworksseemunableto NeuralTuringMachine(NTM):
bindvaluestospecificlocationsin - givinganeuralnetworkan
datastructures.
externalmemoryand
Thisabilityofwritingtoandreading - thecapacitytolearnhowto
frommemoryiscriticalinthetwo
useit
informationprocessingsystemswe
haveavailabletostudy:brainsand
computers.
CuuDuongThanCong.com
/>
Deep Learning
Big Data + Big Deep Model
= Success Guaranteed
State of the art in:
●
computer vision,
●
speech recognition,
●
machine translation, …
–
–
–
CuuDuongThanCong.com
New techniques (e.g.,
initialization, pretraining)
Computing power (GPU,
FPGA, TPU…)
Big datasets
/>
CuuDuongThanCong.com
/>
CuuDuongThanCong.com
/>
Recurrent Neural Networks
➢ Neural networks with cycles
➢ Process inputs of variable length
➢ Preserve state between timesteps
CuuDuongThanCong.com
/>
ER
Recurrent
Neural
Networks
2. B
10
ACKGROUND
b`+1 ]
e: 2.3:
A Recurrent
Neural
Network
is
a very
deep
feedforward
neurn
A
Recurrent
Neural
Network
is
a
very
deep
feedforward
neural
al Network is a very deep feedforward neural network that has a layer f
timestep.
Its
weights
are
shared
across
time.
step.
Its
weights
are
shared
across
time.
shared
across
time.
orks
neural network that has a layer for
.3: A Recurrent Neural Network is a very deep feedforward neural network that has a
mestep.
Its weights
are shared
across the
time.central object of study of this
ent Neural
Network
(RNN),
bes
the
(fig.
2.3).
an
input
(vGiven
. .It,input
visT pa)input
(which
w
dynamical
system
thatGiven
maps
sequences
to sequences.
✓ completely
describes
the RNN
(fig.sequence
2.3).
Given
sequen
ation
✓ RNN
completely
describes
the
RNN
(fig.
2.3).
an
sequ
1 , .an
T and a sequence of
T by t
T
T
T
T
à
utes
a
sequence
of
hidden
states
h
outputs
z
à
and
three
bias
vectors
[W
,
W
,
W
,
b
,
b
,
h
]
whose
con= RNN
à
(v
. .computes
, vT )computes
(whichhv
), vthe
awesequence
of hidden
h1 h
and
etvsequence
),
the
ahh
sequence
states
and
as
0states
oh
hof ohidden
1 , . RNN
1
1a sequ
1by
1
1
on
✓a completely
describes
the
(fig. 2.3). Given an input sequence (v1 , . . . , vT ) (w
T RNN
nd
sequence
of
outputs
z
by
the
1
algorithm:
wing
by
v1T ),algorithm:
the RNN computes a sequence of hidden states hT1 and a sequence of outputs z
CuuDuongThanCong.com
/>
ER
2. BACKGROUND
Figure 2.3:Neural
A Recurrent
Neural Network is a very deep
Recurrent
Networks
each timestep. Its weights are shared across time.
catenation ✓ completely describes the RNN (fig. 2.3). G
denote by v1T ), the RNN computes a sequence of hidde
following algorithm:
1: forNetwork
t fromis 1a very
to Tdeep
dofeedforward neural network that has a
.3: A Recurrent Neural
mestep. Its weights2:are shared
time.
ut across
Whv
vt + Whh ht 1 + bh
3:
ht
e(ut )
4:
ot
Woh ht + bo
on ✓ completely describes the RNN (fig. 2.3). Given an input sequence (v1 , . . . , vT ) (w
5:
zt a sequence
g(ot )of hidden states hT and a sequence of outputs z
by v1T ), the RNN computes
1
6: end for
ng algorithm:
t from 1 to T do
CuuDuongThanCong.com
/>
Vanilla RNN
➢ Basic version of RNN
➢ State: vector h
CuuDuongThanCong.com
/>
Learning RNN: BPTT
T
X
TL(z ; y )
X
t t
L(z, y) =t=1 L(zt ; yt )
L(z, y) =
t=1
The derivatives of the RNNs are easily computed with the backpropa
The
derivatives
the Rumelhart
RNNs areeteasily
computed with the backpr
(BPTT;
Werbos, of
1990;
al., 1986):
(BPTT;
Rumelhart
et al., 1986):
1: for Werbos,
t from T1990;
downto
1 do
g 0T(odownto
t
t ) · dL(zt1; y
t )/dzt
1:2: fordo
t from
do
0 (o
dbto gdb
dot t ; yt )/dzt
o+
2:3: do
t ) · dL(z
>
4:
dW
dW
+
do
h
t
oh
oh
t
3:
dbo
dbo + dot
> do
5:
dh
dh
+
W
t
t h>
oh
4:
dWtoh 0 dW
+
do
t t
oh
6:
dzt
edh(zt+
) ·W
dh>t do
5:
dh
t
t
oh t >
7:
dW
dW + dzt vt
6:
dzt hv e0 (zt ) hv
· dht
8:
dbh
dbh + dzt
7:
dWhv
dWhv + dzt v>t>
9:
dWhh
dWhh + dzt ht 1
8:
db
db
+
dz
>
h
h
10:
dht 1
Whh dztt
>
9:
dW
dW
+
dz
h
t
hh
hh
t 1
11: end for
> dz
10:
dh
W
t
1
hh hvt, dWhh , dWoh , dbh , dbo , dh0 ].
12: Return d✓ = [dW
CuuDuongThanCong.com
/>
Vanilla RNN - problems
➢ Exploding gradient (idea: use gradient clipping)
➢ Vanishing gradient (idea: use ReLU and/or LSTM)
make it difficult to optimize RNNs on sequences with longrange temporal dependencies, and are possible causes for
the abandonment of RNNs by machine learning researchers
CuuDuongThanCong.com
/>
e definitions of the input,
sed.
s
n
ate, ct 2 R be a vector
e
k and let xt be the input
(Hochreiter
and Schmidhuber, 1997)Single values
, Wu , W
Site embeddings
o be matrices and
erms. Input:
We define LSTM as
Output:
nputs (ht 1 , ct 1 , xt ) and
....
LSTM 1
LSTM 2
n all equations below
is
assume also that
is an
nd xt . We used plain sum,
x11 x12 . . . x1M
x21 x22 . . . x2M
so commonly used.
LSTM: Long Short-Term Memory
w much of the information
is defined as:
1
x t ] + bf )
(1)
and the cell update ut are
t
Concatenate
hN
LSTM N
xN
xN
. . . xN
1
2
M
Fig. 2. Overview of the architecture. A core component is a
layer LSTM unrolled for N = 24 time steps that processes M
per-hour measurements. The ith per-hour measurement is marked
After processing N time steps, the last hidden state hN 2 R50
LSTM encodes information about all per-hour measurements. Then,
concatenated with vector s 2 R12 of per-record characteristics and the
e 2 R10 representing the working site id embedding.
➢ Better at learning long-range dependencies
1
1
xt ]➢
+ bAvoid
i)
vanishing
problem
because
all feature values are at the same scale.
(2) gradient
x t ] + bu )
➢ State: a pair of vectors (c, h)
ides how much of the ut
CuuDuongThanCong.com
•
Upsampling positive examples. As presented in T
/>
t
u
t 1
t
u
that slightly
differand
connectivity
and
xshould
can
be
willthebecell
close
to activation
0. Knowing
th
be ignored,
removed ifrom
is
defined
as:
t in
t structure
iffer in connectivity
structure
activation
The input
modulation
gate cvalue
it and the
cell
update
Intuitively,
inputthe
modulation
decides
how
much
o
above,
the
new
cell
value
is
computed
as:
functions.
Below
we
describe
definitions
of
the
input,
t
ow we describe the definitions should
of the be
input,
f
=
sigm(W
bf )For exa
defined
as:
added
to the memory
step
t
f ⇤ [ht 1 at x
t ] + t.
and forget gates
that
wec used.
get gates thatoutput
we used.
fitt will
ct ibe
ittn1 gtotx0.
s
x
can
be
ignored,
close
Knowing
th
1⇤+
t
itt =
= sigm(W
[h
]
+
b
)
t
i
n
n
n The
input
modulation
gate
value
i
and
the
cell
update
t be a vector
R be a hidden
bea ahidden
vector
Letstate,
ht c2t 2
R R be
state,
ct c2t isRcomputed
above,
the
new
cell
value
as:b )
defined
as:
u
=
tanh(W
⇤
[h
x
]
+
last
passed
to
u ht , tthe
1 output
t
u
lls of the of
network
and let
be the
the step
inputist to compute
memory
cellsxt The
of
network
and let
xt be
the
input
LSTM’s
time
step.
Itt is ccontrolled
it =modulation
sigm(W
[hdecides
xt the
] +output
bimuch
) Sitegate
c
f
g
i 1⇤ +
tiSingle
1 by
thow
t
t
ep t. Let W
,
W
,
W
,
W
be
matrices
and
values
emb
Intuitively,
input
i
f
u
o
at the time step t. Let Wi , Wf , Wu , Wo be matrices and of
u
tanh(W
xtx]t+
)o ) exam
should
be bias
added
to
the We
memory
step
t.+buFor
t =
uo⇤⇤[h
t t 1at
e respectiveb bias
terms.
We
define
LSTM
as
o
=
sigm(W
[h
]
b
t
1
,
b
,
b
,
b
the
respective
terms.
define
LSTM
as to
The last step is to compute ht , the output passed
i f u o
x
ignored,
it will be LSTM
close
to how
0. Knowing
thet
on that takes 3 inputs (ht 1 , LSTM’s
cIntuitively,
)time
and
tt can
1 , xtbe
input
modulation
decides
much
of
1
LSTM
2
h
=
o
tanh(c
)
Itt is controlled
a transformation that
takes
3step.
(h
, ctby1the
, xas:
) and gate
t inputs
t is1tcomputed
t output
above,
the
new
cell
value
c
t
tputs (ht and ct ). In all equationsshould
belowbe added
is to the memory
at step t. For exam
produces 2 outputsthat
(htLSTM
and
cnetworks
Insigm(W
allwill
equations
below
isthe
to).t =
have
been
successfully
applied
⇤
[h
x
]
+
b
)
o
t
1
t
o
xt can
beis ignored,
i
be
close
to
0.
Knowing
multiplication. We assume also The
an
t
c
=
f
ct 1language
+ it that
gtmodeling
t
t
element-wise
multiplication.
We
assume
also
is. . . [18],
an
world
problems,
including
h
above,
new
c
1 is 1 computed
1
2as:2
2
t
aggregates ht 1 and xt . We used
plainthesum,
ht cell
= ovalue
tanh(c
)
x
x
.
.
.
x
x
x
x
t
1
2t
1
2
M
M
ing
[7]
or
speech
[6]
recognition,
and
machine
translat
operation
that aggregates
xt . Weht ,used
plainpassed
sum, to t
The last h
step
to compute
the output
t 1is and
on of vectors
is also commonly
used.
ct = ft have
ct 1 been
+ it successfully
gt
The
LSTM
networks
applied
LSTM’s
time
step.
It
is
controlled
by
the
output
gate
but concatenation of vectors is also commonly
used.
V.
M ODEL
Fig.
2.
Overview
of modeling
the architecture.
world
problems,
including
language
[18],
hA
The
last
step
is
to
compute
h
,
the
output
passed
to
th
t
e which decides how much of the
information
ot =[6]
sigm(W
[ht and
xtN
] +=bo )translat
layer
LSTM
24 timeF
o ⇤unrolled
1 for
ing
[7]
or
speech
recognition,
machine
A.
Preprocessing
LSTM’s time step.per-hour
It is controlled
by the
output
gate o
measurements.
The
ith
per-hour
oved from the
cellforget
is defined
The
gate as:
which decideshthow
much
of
the
information
= ot tanh(ct )
l
Recall fromo Section
II
that
most
of
the
solution
After
processing
N time
steps,
the last h
=
sigm(W
⇤
[h
x
]
+
b
)
V.
M
ODEL
t
o
t
1
t
o
p
should
be
removed
from
the
cell
is
defined
as:
LSTM
encodes
information
about
all
per-h
previous
challenge
depend
heavily
on
feature
eng
The
LSTM
networks
have
been
successfully
applied
x t ] + bf )
(1)h = o tanh(c )
t = sigm(Wf ⇤ [ht 1
A
t concatenated
t
t vector s 2 R12 of per-rec
with
A.
Preprocessing
worldapproach,
problems, while
including
language
modelingmakes
[18], ha
Such
effective
in practice,
thL
10
e 12 R xtrepresenting
working (1)
site id e
ft cell
= sigm(W
⇤networks
[ht [6]
] been
+ bfsuccessfully
) the
ftspeech
ulation gate value it and the
update
u
are
The
LSTM
have
applied
toc
ing
[7]
or
recognition,
and
machine
translatio
lessRecall
generalizable
as the IIfeature
engineering
de
from Section
that most
of the steps
solution
world
problems,
including
language
modeling
[18],a eng
han
the
problem
at
hand.
Our
goal
was
to
create
previous
challenge
depend
heavily
on
feature
The input modulation gate value i and the cell update u are moe
LSTM Cell
CuuDuongThanCong.com
/>
Sequence-to-Sequence
& Attention
CuuDuongThanCong.com
/>
Sequence-to-sequence model
Sutskever et al, NIPS 2014
CuuDuongThanCong.com
/>
Sequence-to-sequence model
encoder
CuuDuongThanCong.com
decoder
/>
Sequence-to-sequence model
Ingests incoming message
CuuDuongThanCong.com
Generates reply message
/>
Reading a sequence of words into
an RNN
How
CuuDuongThanCong.com
/>
Reading a sequence of words into
an RNN
How
are
CuuDuongThanCong.com
/>
Reading a sequence of words into
an RNN
How
are
you
CuuDuongThanCong.com
/>
Reading a sequence of words into
an RNN
How
are
you
CuuDuongThanCong.com
?
/>
Encoder ingests the incoming
message
Internal state is a fixed
length encoding of the
message
How
are
you
CuuDuongThanCong.com
?
/>
Decoder is initialized with final
state of encoder
How
are
you
CuuDuongThanCong.com
?
__
/>