Bài giảng Tối ưu hóa nâng cao - Chương 9: Stochastic gradient descent

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (211.21 KB, 10 trang )

Stochastic Gradient Descent

Hoàng Nam Dũng

</div>
<span class='text_page_counter'>(2)</span><div class='page_container' data-page=2>

Last time: proximal gradient descent

Consider the problem

min

x g(x) +h(x)

withg,h convex,g differentiable, andh “simple” in so much as
proxt(x) = argminz

2tkx−zk
2

2+h(z)

is computable.

Proximal gradient descent: letx(0)<sub>∈</sub>Rn, repeat:

x(k)= proxtk(x

(k−1)<sub>−</sub><sub>t</sub>

k∇g(x(k−1))), k =1,2,3, . . .

Step sizestk chosen to be fixed and small, or via backtracking.
If<sub>∇</sub>g is Lipschitz with constantL, then this has convergence rate

</div>
<span class='text_page_counter'>(3)</span><div class='page_container' data-page=3>

Outline

Today:

I Stochastic gradient descent
I Convergence rates

</div>
<span class='text_page_counter'>(4)</span><div class='page_container' data-page=4>

Stochastic gradient descent

Consider minimizing an average of functions
min

x
1

i=1
fi(x).

As<sub>∇</sub>Pm

i=1fi(x) =Pmi=1∇fi(x), gradient descent would repeat

x(k)=x(k−1)<sub>−</sub>tk ·
1

i=1

∇fi(x(k−1)), k =1,2,3, . . .
In comparison,stochastic gradient descentor SGD (or incremental
gradient descent) repeats:

x(k)=x(k−1)−tk· ∇fik(x

</div>
<span class='text_page_counter'>(5)</span><div class='page_container' data-page=5>

Stochastic gradient descent

Two rules for choosing indexik at iterationk:

I Randomized rule: choose ik ∈ {1, ...m} uniformly at random.
I Cyclic rule: choose ik =1,2, . . . ,m,1,2, . . . ,m, . . .

Randomized rule is more common in practice. For randomized rule,
note that

E[∇fik(x)] =∇f(x),

so we can view SGD as using anunbiased estimateof the gradient
at each step.

Main appeal of SGD:

</div>
<span class='text_page_counter'>(6)</span><div class='page_container' data-page=6>

Example: stochastic logistic regression

Given(xi,yi)∈Rp× {0,1},i =1, . . . ,n, recall logistic regression

min

β f(β) =
1

i=1

−yixiTβ+ log(1+ exp(xiTβ))

| {z }

fi(β)

Gradient computation<sub>∇</sub>f(β) = 1

i=1(yi−pi(β))xi is doable
whenn is moderate, butnot when n is huge.

Full gradient (also called batch) versus stochastic gradient:
I One batch update costs O(np).

I One stochastic update costsO(p).

</div>
<span class='text_page_counter'>(7)</span><div class='page_container' data-page=7>

Batch vs. stochastic gradient descent

Small example withn =10,p=2 to show the “classic picture” for
batch versus stochastic methods:

Small example withn= 10,p= 2to show the “classic picture” for
batch versus stochastic methods:

−20 −10 0 10 20

−20
−10

0
10
20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●

●
●
●
●
●
●
●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●

●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●●
●
●
●
●
●
●
●

*

●
●
Batch
Random

Blue: batch steps, O(np)

Red: stochastic steps,O(p)
Rule of thumb for stochastic
methods:

• generally thrive far
from optimum

• generally struggle close
to optimum

Blue: batch steps,O(np)
Red: stochastic steps,O(p)
Rule of thumb for stochastic
methods:

I generally thrive far from
optimum

I generally struggle close
to optimum

</div>
<span class='text_page_counter'>(8)</span><div class='page_container' data-page=8>

Step sizes

Standard in SGD is to usediminishing step sizes, e.g., tk =1/k,
fork =1,2,3, . . .

Why not fixed step sizes? Here’s some intuition.

Suppose we take cyclic rule for simplicity. Settk =t for m updates
in a row, we get

x(k+m) =x(k)<sub>−</sub>t

i=1

∇fi(x(k+i−1)).
Meanwhile, full gradient with step sizet would give

x(k+1)=x(k)<sub>−</sub>t

i=1

∇fi(x(k)).
The difference here:tPm

</div>
<span class='text_page_counter'>(9)</span><div class='page_container' data-page=9>

Convergence rates

Recall: for convexf, (sub)gradient descent with diminishing step
sizes satisfies

f(x(k))−f∗ =O(1/√k).

Whenf is differentiable with Lipschitz gradient, there holds for
gradient descent with suitable fixed step sizes

f(x(k))−f∗ =O(1/k).

What about SGD? For convexf, SGD with diminishing step sizes
satisfies1

E[f(x(k))]−f∗ =O(1/√k).

Unfortunately thisdoes not improvewhen we further assume f has
Lipschitz gradient.

1<sub>E.g., Nemirosvki et al. (2009), “Robust stochastic optimization approach to</sub>

</div>
<span class='text_page_counter'>(10)</span><div class='page_container' data-page=10>

Convergence rates

Even worse is the following discrepancy!

Whenf is strongly convex and has a Lipschitz gradient, gradient
descent satisfies

f(x(k))−f∗=O(ck)

wherec <1. But under same conditions, SGD gives us2

E[f(x(k))]−f∗ =O(1/k).

So stochastic methods do not enjoy thelinear convergence rateof
gradient descent under strong convexity.

What can we do to improve SGD?

</div>

Bài giảng Tối ưu hóa nâng cao - Chương 9: Stochastic gradient descent

Stochastic Gradient Descent

*

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về