Tải bản đầy đủ (.pdf) (50 trang)

Bài giảng Tối ưu hóa nâng cao: Chương 8 - Hoàng Nam Dũng

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.6 MB, 50 trang )

Proximal Gradient Descent (and Acceleration)

Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội


Last time: subgradient method
Consider the problem
min f (x)
x

with f convex, and dom(f ) = Rn .
Subgradient method: choose an initial x (0) ∈ Rn , and repeat:
x (k) = x (k−1) − tk · g (k−1) ,

k = 1, 2, 3, . . .

where g (k−1) ∈ ∂f (x (k−1) ). We use pre-set rules for the step sizes
(e.g., diminshing step sizes rule).
If f is Lipschitz, then subgradient method has a convergence rate
O(1/ε2 ).
Upside: very generic. Downside: can be slow — addressed today.
1


Outline

Today
Proximal gradient descent
Convergence analysis
ISTA, matrix completion


Special cases
Acceleration

2


Decomposable functions
Suppose
f (x) = g (x) + h(x)
where
g is convex, differentiable, dom(g ) = Rn
h is convex, not necessarily differentiable.
If f were differentiable, then gradient descent update would be
x + = x − t · ∇f (x)
Recall motivation: minimize quadratic approximation to f around
x, replace ∇2 f (x) by 1t I
1
x + = argminz f (x) + ∇f (x)T (z − x) +
z − x 22 .
2t
f˜t (z)
3


Decomposable functions
In our case f is not differentiable, but f = g + h, g differentiable.
Why don’t we make quadratic approximation to g , leave h alone?
I.e., update
x + = argminz g˜t (z) + h(z)
= argminz g (x) + ∇g (x)T (z − x) +

= argminz
1
2t

1
z − (x − t∇g (x))
2t

z − (x − t∇g (x))
h(z)

2
2

2
2

1
z −x
2t

2
2

+ h(z)

+ h(z).

stay close to gradient update for g
also make h small


4


Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
1
proxh (x) = argminz x − z 22 + h(z).
2

5


Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
1
proxh (x) = argminz x − z 22 + h(z).
2
Examples:
h(x) = 0: proxh (x) = x.

5


Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
1

proxh (x) = argminz x − z 22 + h(z).
2
Examples:
h(x) = 0: proxh (x) = x.
h(x) is indicator function of a closed convex set C : proxh is
the projection on C
1
x − z 22 = PC (x).
proxh (x) = argminz∈C
2

5


Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
1
proxh (x) = argminz x − z 22 + h(z).
2
Examples:
h(x) = 0: proxh (x) = x.
h(x) is indicator function of a closed convex set C : proxh is
the projection on C
1
x − z 22 = PC (x).
proxh (x) = argminz∈C
2
h(x) = x 1 : proxh is the ’soft-threshold’ (shrinkage)
operation



 xi − 1 xi ≥ 1
proxh (x)i =
0
|xi | ≤ 1


xi + 1 xi ≤ −1.

5


Proximal mapping
Theorem
If h is convex and closed (has closed epigraph) then
1
proxh (x) = argminz x − z 22 + h(z).
2
exists and is unique for all x.
Chứng minh.
See />proxop.pdf
Uniqueness since the objective function is strictly convex.

6


Proximal mapping
Theorem
If h is convex and closed (has closed epigraph) then

1
proxh (x) = argminz x − z 22 + h(z).
2
exists and is unique for all x.
Chứng minh.
See />proxop.pdf
Uniqueness since the objective function is strictly convex.
Optimality condition
z = proxh (x) ⇔ x − z ∈ ∂h(z)

⇔ h(u) ≥ h(z) + (x − z)T (u − z), ∀u.

6


Properties of proximal mapping
Theorem
Proximal mappings are firmly nonexpansive (co-coercive with
constant 1)
(proxh (x) − proxh (y ))T (x − y ) ≥ proxh (x) − proxh (y ) 22 .

7


Properties of proximal mapping
Theorem
Proximal mappings are firmly nonexpansive (co-coercive with
constant 1)
(proxh (x) − proxh (y ))T (x − y ) ≥ proxh (x) − proxh (y ) 22 .
Chứng minh.

With u = proxh (x) and v = proxh (y ) we have
x − u ∈ ∂f (u) and y − v ∈ ∂f (v ).
From the monotonicity of subdifferential we get
(x − u) − (y − v )

T

(u − v ) ≥ 0.

7


Properties of proximal mapping
Theorem
Proximal mappings are firmly nonexpansive (co-coercive with
constant 1)
(proxh (x) − proxh (y ))T (x − y ) ≥ proxh (x) − proxh (y ) 22 .
Chứng minh.
With u = proxh (x) and v = proxh (y ) we have
x − u ∈ ∂f (u) and y − v ∈ ∂f (v ).
From the monotonicity of subdifferential we get
(x − u) − (y − v )

T

(u − v ) ≥ 0.

From firm nonexpansiveness and Cauchy-Schwarz inequality we get
nonexpansiveness (Lipschitz continuity with constant 1)
proxh (x) − proxh (y )


2

≤ x −y

2.

7


Proximal gradient descent

Proximal gradient descent: choose initialize x (0) , repeat:
x (k) = proxtk h x (k−1) − tk · ∇g (x (k−1) ) ,

k = 1, 2, 3, . . .

8


Proximal gradient descent

Proximal gradient descent: choose initialize x (0) , repeat:
x (k) = proxtk h x (k−1) − tk · ∇g (x (k−1) ) ,

k = 1, 2, 3, . . .

To make this update step look familiar, can rewrite it as
x (k) = x (k−1) − tk · Gtk (x (k−1) )
where Gt is the generalized gradient of f ,

x − proxth (x − t∇g (x))
Gt (x) =
.
t

8


Proximal gradient descent

Proximal gradient descent: choose initialize x (0) , repeat:
x (k) = proxtk h x (k−1) − tk · ∇g (x (k−1) ) ,

k = 1, 2, 3, . . .

To make this update step look familiar, can rewrite it as
x (k) = x (k−1) − tk · Gtk (x (k−1) )
where Gt is the generalized gradient of f ,
x − proxth (x − t∇g (x))
Gt (x) =
.
t
For h = 0 it is gradient descent.

8


Examples
minimize


g(x) + h(x)

Gradient method: special case with h(x) = 0

x+ = x − t∇g(x)
Gradient projection method: special case with h(x) = δC (x) (indicator of C )

x
x+ = PC (x − t∇g(x))

Proximal gradient method

C

x+

x − t∇g(x)

6-5


What good did this do?
You have a right to be suspicious ... may look like we just swapped
one minimization problem for another.
Key point is that proxh (·) is can be computed analytically for a lot
of important functions h1 .
Note:
Mapping proxh (·) doesn’t depend on g at all, only on h.
Smooth part g can be complicated, we only need to compute
its gradients.

Convergence analysis: will be in terms of number of iterations of
the algorithm. Each iteration evaluates proxh (·) once and this can
be cheap or expensive depending on h.
1

see />9


Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)
Given y ∈ Rn , X ∈ Rn×p , recall lasso criterion
1
f (β) = y − X β 22 + λ β 1 .
2
g (β)

h(β)

Proximal mapping is now
proxth (β) = argminz

1
β−z
2t

2
2

+λ z

1


= Sλt (β),
where Sλ (β) is the soft-thresholding operator



βi − λ if βi > λ
[Sλ (β)]i = 0
if − λ ≤ βi ≤ λ,



βi + λ if βi < −λ

i = 1, . . . , n.

10


Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)
T (y − Xβ), hence proximal gradient update is:
T (y
Recall
−X
Recall
∇g∇g(β)
(β) = =
−X
− X β), hence proximal gradient update is
T

+ Sλt (β + tX (yT − X β)).
β +β=
= S β + tX (y − Xβ)
λt

f−fstar

0.10
0.05

Subgradient method
Proximal gradient

0.02

Example of proximal
Example of proximal
gradient (ISTA) vs.
gradient (ISTA) vs.
subgradient method
subgradient method
convergence
rates
convergence
rates

0.20

0.50


Often called the iterative soft-thresholding algorithm (ISTA)2 . Very
Oftenalgorithm.
called the iterative soft-thresholding algorithm (ISTA).1 Very
simple
simple algorithm

0

200

400

600

800

1000

k

2

Beck and Teboulle (2008), “A fast iterative shrinkage-thresholding algorithm
1
Beck
and problems”
Teboulle (2008), “A fast iterative shrinkage-thresholding
for linear
inverse


11


Backtracking line search

Backtracking for prox gradient descent works similar as before (in
gradient descent), but operates on g and not f .
Choose parameter 0 < β < 1. At each iteration, start at t = tinit ,
and while
t
g (x − tGt (x)) > g (x) − t∇g (x)T Gt (x) + Gt (x) 22
2
shrink t = βt, for some 0 < β < 1. Else perform proximal gradient
update.
(Alternative formulations exist that require less computation, i.e.,
fewer calls to prox)

12


Convergence analysis
For criterion f (x) = g (x) + h(x), we assume
g is convex, differentiable, dom(g ) = Rn , and ∇g is Lipschitz
continuous with constant L > 0.
h is convex, proxth (x) = argminz { x − z 22 /(2t) + h(z)} can
be evaluated.
Theorem
Proximal gradient descent with fixed step size t ≤ 1/L satisfies

x (0) − x ∗ 22

2tk
and same result holds for backtracking with t replaced by β/L.
f (x (k) ) − f ∗ ≤

13


Convergence analysis
For criterion f (x) = g (x) + h(x), we assume
g is convex, differentiable, dom(g ) = Rn , and ∇g is Lipschitz
continuous with constant L > 0.
h is convex, proxth (x) = argminz { x − z 22 /(2t) + h(z)} can
be evaluated.
Theorem
Proximal gradient descent with fixed step size t ≤ 1/L satisfies

x (0) − x ∗ 22
2tk
and same result holds for backtracking with t replaced by β/L.
f (x (k) ) − f ∗ ≤

Proximal gradient descent has convergence rate O(1/k) or O(1/ε).
Same as gradient descent! (But remember, prox cost matters ...).
13


Convergence analysis
For criterion f (x) = g (x) + h(x), we assume
g is convex, differentiable, dom(g ) = Rn , and ∇g is Lipschitz
continuous with constant L > 0.

h is convex, proxth (x) = argminz { x − z 22 /(2t) + h(z)} can
be evaluated.
Theorem
Proximal gradient descent with fixed step size t ≤ 1/L satisfies

x (0) − x ∗ 22
2tk
and same result holds for backtracking with t replaced by β/L.
f (x (k) ) − f ∗ ≤

Proximal gradient descent has convergence rate O(1/k) or O(1/ε).
Same as gradient descent! (But remember, prox cost matters ...).
Proof: See />lectures/proxgrad.pdf

13


×