Subgradients
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Last time: gradient descent
Consider the problem
min f (x)
x
for f convex and differentiable, dom(f ) = Rn .
Gradient descent: choose initial x (0) ∈ Rn , repeat
x (k) = x (k−1) − tk · ∇f (x (k−1) ),
k = 1, 2, 3, . . .
Step sizes tk chosen to be fixed and small, or by backtracking line
search
If ∇f Lipschitz, gradient descent has convergence rate O(1/ε)
Downsides:
Requires f differentiable
Can be slow to converge
1
Outline
Today:
Subgradients
Examples
Properties
Optimality characterizations
2
Basic inequality
Basic inequality
recall the basic inequality for differentiable convex functions:
Recall that for convex and differentiable f ,
T T
f (y )f≥
+ ∇f
(x)(x)
(y (y
−−x),
dom(f
(y)f (x)
≥ f (x)
+ ∇f
x) ∀x,
∀yy∈∈dom
f ).
(x, f (x))
∇f (x)
−1
first-order
approximation
f at xlower
is abound
global lower
• the The
first-order
approximation
of f at x isof
a global
bound.
∇f defines a non-vertical supporting hyperplane to epi(f ) at
T
∇f (x)
y
x
(x, f (x))
−
≤ 0 ∀(y, t) ∈ epi f
• ∇f (x) defines a non-vertical supporting hyperplane to epi f at (x, f (x)):
Subgradients
∇f
−1
−1
y
t
t
f (x)
−
x
f (x)
≤ 0, ∀(y , t) ∈ epi(f ).
3
4-2
Subgradients
A subgradient of a convex function f at x is any g ∈ Rn such that
f (y ) ≥ f (x) + g T (y − x), ∀y ∈ dom(f ).
Always exists (on the relative interior of dom(f ))
If f differentiable at x, then g = ∇f (x) uniquely
Same definition works for nonconvex f (however, subgradients
need not exist).
4
Subgradients
A subgradient of a convex function f at x is any g ∈ Rn such that
f (y ) ≥ f (x) + g T (y − x), ∀y ∈ dom(f ).
Always exists (on the relative
interior of dom(f ))
Subgradient
If f differentiable at x, then g = ∇f (x) uniquely
g is a subgradient of a convex function f at x ∈ dom f if
Same
definition works for nonconvex f (however, subgradients
need not exist).f (y) ≥ f (x) + gT (y − x) ∀y ∈ dom f
f (y)
f (x1) + g1T (y − x1)
f (x1) + g2T (y − x1)
f (x2) + g3T (y − x2)
x1
x2
g1 and g2 are subgradients at x1 , g3 is subgradient at x2 .
g1, g2 are subgradients at x1; g3 is a subgradient at x2
4
Examples of subgradients
Examples of subgradients
−0.5
0.0
0.5
f(x)
1.0
1.5
2.0
Considerff :: RR →
→ R,
R, ff (x) = |x|
Consider
−2
−1
0
1
2
x
• For x = 0, unique subgradient g = sign(x)
For x = 0, unique subgradient g = sign(x)
For x = 0, subgradient g is any element of [−1, 1].
• For x = 0, subgradient g is any element of [−1, 1]
5
5
Examples of subgradients
Considerf f: :RRnn→
→R,
R,ff (x)
(x) =
= xx
Consider
22
f(x)
x2
x1
x
Forxx==0,0,unique
uniquesubgradient
subgradientg g==x/ x 2
• For
x 2
• For
Forxx==0,0,subgradient
subgradientg gis isany
anyelement
elementofof{z{z: : z z2 2≤≤1}1}.
6
Examples of subgradients
Considerff :: RRnn →
→ R,
R, ff (x)
Consider
(x) =
= xx
11
f(x)
x2
x1
For x = 0, unique ith component g = sign(x )
For xi = 0, ith component gi is any element of [−1, 1].
• For xi = 0, ith component gi is any element of [−1, 1]
• For xi i = 0, unique ith component gi i = sign(xii)
7
Examples of subgradients
0
5
f(x)
10
15
n
Considerf f(x)
(x)==max{f
max{f11(x),
(x), ff22(x)}, for
Consider
for ff11,,ff2 2: R
: Rn→→RRconvex,
convex,
differentiable
differentiable
−2
−1
0
1
2
x
Forf1f1(x)
(x)>>ff22(x),
(x), unique
unique subgradient
subgradient gg =
= ∇f
∇f11(x)
• For
Forf2f2(x)
(x)>>ff11(x),
(x), unique
unique subgradient
subgradient gg =
= ∇f
∇f22(x)
(x)
• For
Forf1f1(x)
(x)==ff22(x),
(x), subgradient
subgradient gg isis any
any point
point on
on line
line segment
segment
• For
between∇f
∇f11(x)
(x)and
and∇f
∇f22(x)
(x).
between
8
Subdifferential
Set of all subgradients of convex f is called the subdifferential:
∂f (x) = {g ∈ Rn : g is a subgradient of f at x}.
9
Subdifferential
Set of all subgradients of convex f is called the subdifferential:
∂f (x) = {g ∈ Rn : g is a subgradient of f at x}.
Properties:
Nonempty for convex f at x ∈ int(domf )
∂f (x) is closed and convex (even for nonconvex f )
If f is differentiable at x, then ∂f (x) = {∇f (x)}
If ∂f (x) = {g }, then f is differentiable at x and ∇f (x) = g .
Proof: See />lectures/subgradients.pdf
9
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v )T (x − y ) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y ).
Chứng minh.
By definition we have
f (y ) ≥ f (x) + u T (y − x) and f (x) ≥ f (y ) + v T (x − y ).
Combining them get shows monotonicity.
10
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v )T (x − y ) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y ).
Chứng minh.
By definition we have
f (y ) ≥ f (x) + u T (y − x) and f (x) ≥ f (y ) + v T (x − y ).
Combining them get shows monotonicity.
Question: Monotonicity for differentiable convex function?
10
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v )T (x − y ) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y ).
Chứng minh.
By definition we have
f (y ) ≥ f (x) + u T (y − x) and f (x) ≥ f (y ) + v T (x − y ).
Combining them get shows monotonicity.
Question: Monotonicity for differentiable convex function?
(∇f (x) − ∇f (y ))T (x − y ) ≥ 0,
which follows directly from the first order characterization of
convex functions.
10
Examples of non-subdifferentiable functions
The following functions are not subdifferentiable at x = 0
f : R → R, dom(f ) = R+
f (x) =
1 if x = 0
0 if x > 0.
f : R → R, dom(f ) = R+
√
f (x) = − x.
The only supporting hyperplane to epi(f ) at (0, f (0)) is vertical.
11
Connection to convex geometry
Convex set C ⊆ Rn , consider indicator function IC : Rn → R,
0
if x ∈ C
IC (x) = I {x ∈ C } =
∞ if x ∈ C .
For x ∈ C , ∂IC (x) = NC (x), the normal cone of C at x is, recall
NC = {g ∈ Rn : g T x ≥ g T y for any y ∈ C }.
Why? By definition of subgradient g ,
IC (y ) ≥ IC (x) + g T (y − x) for all y .
For y ∈ C , IC (y ) = ∞
For y ∈ C , this means 0 ≥ g T (y − x).
12
●
●
●
●
13 11
Subgradient calculus
Basic rules for convex functions:
Scaling: ∂(af ) = a · ∂f provided a > 0.
Addition: ∂(f1 + f2 ) = ∂f1 + ∂f2 .
Affine composition: if g (x) = f (Ax + b), then
∂g (x) = AT ∂f (Ax + b).
Finite pointwise maximum: if f (x) = maxi=1,...m fi (x), then
∂f (x) = conv
∂fi (x)
i:fi (x)=f (x)
convex hull of union of subdifferentials of active functions at
x.
14
Subgradient calculus
General pointwise maximum: if f (x) = maxs∈S fs (x), then
∂f (x) ⊇ cl conv
∂fs (x)
.
s:fs (x)=f (x)
Under some regularity conditions (on S, fs ), we get equality.
Norms: important special case, f (x) = x
that 1/p + 1/q = 1, then
x
p
p.
Let q be such
= max z T x.
z
q ≤1
And
∂f (x) = argmax
z
q ≤1
z T x.
15
Why subgradients?
Subgradients are important for two reasons:
Convex analysis: optimality characterization via subgradients,
monotonicity, relationship to duality.
Convex optimization: if you can compute subgradients, then
you can minimize any convex function.
16
Optimality condition
Subgradient optimality condition: For any f (convex or not),
f (x ∗ ) = min f (x) ⇐⇒ 0 ∈ ∂f (x ∗ ),
x
i.e., x ∗ is a minimizer if and only if 0 is a subgradient of f at x ∗ .
17
Optimality condition
Subgradient optimality condition: For any f (convex or not),
f (x ∗ ) = min f (x) ⇐⇒ 0 ∈ ∂f (x ∗ ),
x
i.e., x ∗ is a minimizer if and only if 0 is a subgradient of f at x ∗ .
Why? Easy: g = 0 being a subgradient means that for all y
f (y ) ≥ f (x ∗ ) + 0T (y − x ∗ ) = f (x ∗ ).
Note the implication for a convex and differentiable function f
with ∂f (x) = {∇f (x)}.
17
Derivation of first-order optimality
Example of the power of subgradients: we can use what we have
learned so far to derive the first-order optimality condition.
Theorem
For f convex and differentiable and C convex
min f (x) subject to x ∈ C
x
is solved at x if and only if
∇f (x)T (y − x) ≥ 0 for all y ∈ C .a
a
Direct proof see, e.g., />Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf. Proof using subgradient
next slide.
Intuitively: says that gradient increases as we move away from x.
Note that for C = Rn (unconstrained case) it reduces to ∇f = 0.
18
Derivation of first-order optimality
Chứng minh.
First recast problem as
min f (x) + IC (x).
x
Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x)).
Observe
0 ∈ ∂(f (x) + IC (x)) ⇔ 0 ∈ {∇f (x)} + NC (x)
⇔ −∇f (x) ∈ NC (x)
⇔ −∇f (x)T x ≥ −∇f (x)T y for all y ∈ C
⇔ ∇f (x)T (y − x) ≥ 0 for all y ∈ C
as desired.
Note: the condition 0 ∈ ∂f (x) + NC (x) is a fully general condition
for optimality in convex problems. But it’s not always easy to work
with (KKT conditions, later, are easier).
19