DSpace at VNU: Combining Lagrangian decomposition and excessive gap smoothing technique for solving large-scale separable convex optimization problems

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.22 MB, 37 trang )

Comput Optim Appl (2013) 55:75–111
DOI 10.1007/s10589-012-9515-6

Combining Lagrangian decomposition and excessive
gap smoothing technique for solving large-scale
separable convex optimization problems
Quoc Tran Dinh · Carlo Savorgnan · Moritz Diehl

Received: 22 June 2011 / Published online: 18 November 2012
© Springer Science+Business Media New York 2012

Abstract A new algorithm for solving large-scale convex optimization problems
with a separable objective function is proposed. The basic idea is to combine three
techniques: Lagrangian dual decomposition, excessive gap and smoothing. The main
advantage of this algorithm is that it automatically and simultaneously updates the
smoothness parameters which significantly improves its performance. The convergence of the algorithm is proved under weak conditions imposed on the original
problem. The rate of convergence is O( k1 ), where k is the iteration counter. In the
second part of the paper, the proposed algorithm is coupled with a dual scheme to
construct a switching variant in a dual decomposition framework. We discuss implementation issues and make a theoretical comparison. Numerical examples confirm
the theoretical results.
Keywords Excessive gap · Smoothing technique · Lagrangian decomposition ·
Proximal mappings · Large-scale problem · Separable convex optimization ·
Distributed optimization

Q. Tran Dinh ( ) · C. Savorgnan · M. Diehl
Department of Electrical Engineering (ESAT-SCD) and Optimization in Engineering Center
(OPTEC), KU Leuven, Kasteelpark Arenberg 10, 3001 Heverlee-Leuven, Belgium
e-mail:
C. Savorgnan
e-mail:
M. Diehl

e-mail:
Q. Tran Dinh
Vietnam National University, Hanoi, Vietnam

76

Q. Tran Dinh et al.

1 Introduction
Large-scale convex optimization problems appear in many areas of science such
as graph theory, networks, transportation, distributed model predictive control, distributed estimation and multistage stochastic optimization [16, 20, 30, 35–37, 39].
Solving large-scale optimization problems is still a challenge in many applications [4]. Over the years, thanks to the development of parallel and distributed computer systems, the chances for solving large-scale problems have been increased.
However, methods and algorithms for solving this type of problems are limited [1, 4].
Convex minimization problems with a separable objective function form a class
of problems which is relevant in many applications. This class of problems is also
known as separable convex minimization problems, see, e.g. [1]. Without loss of
generality, a separable convex optimization problem can be written in the form of a
convex program with separable objective function and coupled linear constraints [1].
In addition, decoupling convex constraints may also be considered. Mathematically,
this problem can be formulated in the following form:
M

minn φ(x) :=

x∈R

φi (xi )
i=1

s.t. xi ∈ Xi ,

i = 1, . . . , M,

(1)

M

Ai xi = b,
i=1

where φi : Rni → R is convex, Xi ∈ Rni is a nonempty, closed convex set,
Ai ∈ Rm×ni , b ∈ Rm for all i = 1, . . . , M, and n1 + n2 + · · · + nM = n. The last
constraint is called coupling linear constraint.
In the literature, several solution approaches have been proposed for solving problem (1). For example, (augmented) Lagrangian relaxation and subgradient methods
of multipliers [1, 10, 29, 36], Fenchel’s dual decomposition [11], alternating direction methods [2, 9, 13, 15], proximal point-type methods [3, 33], spitting methods
[7, 8], interior point methods [18, 32, 39], mean value cross decomposition [17] and
partial inverse method [31] have been studied among many others. One of the classical approaches for solving (1) is Lagrangian dual decomposition. The main idea
of this approach is to solve the dual problem by means of a subgradient method. It
has been recognized in practice that subgradient methods are usually slow [25] and
numerically sensitive to the choice of step sizes. In the special case of a strongly
convex objective function, the dual function is differentiable. Consequently, gradient
schemes can be applied to solve the dual problem.
Recently, Nesterov [25] developed smoothing techniques for solving nonsmooth
convex optimization problems based on the fast gradient scheme which was introduced in his early work [24]. The fast gradient schemes have been used in numerous
applications including image processing, compressed sensing, networks and system
identification, see e.g. [9, 12, 28]. Exploiting Nesterov’s idea in [26], Necoara and
Suykens [22] applied smoothing technique to the dual problem in the framework of

Excessive gap smoothing techniques in Lagrangian dual decomposition

77

Lagrangian dual decomposition and then used Nesterov’s fast gradient scheme to
maximize the smoothed function of the dual problem. This resulted in a new variant
of dual decomposition algorithms for solving separable convex optimization. The authors proved that the rate of convergence of their algorithm is O( k1 ) which is much
better than O( √1 ) in the subgradient methods of multipliers [6, 23], where k is the
k
iteration counter. A main disadvantage of this scheme is that the smoothness parameter requires to be given a priori. Moreover, this parameter crucially depends on a
given desired accuracy. Since the Lipschitz constant of the gradient of the objective
function in the dual problem is inversely proportional to the smoothness parameter,
the algorithm usually generates short steps towards a solution of the dual problem
although the rate of convergence is O( k1 ).
To overcome this drawback, in this paper, we propose a new algorithm which combines three techniques: smoothing [26, 27], excessive gap [27] and Lagrangian dual
decomposition [1] techniques. Although the convergence rate is still O( k1 ), the algorithms developed in this paper have some advantages compared to the one in [22].
First, instead of fixing the smoothness parameters, we update them dynamically at
every iteration. Second, our algorithm is a primal-dual method which not only gives
us a dual approximate solution but also a primal approximate solution of (1). Note
that the computational cost of the proposed algorithms remains almost the same as in
the proximal-center-based decomposition algorithm proposed in [22, Algorithm 3.2].
(Algorithm 3.2 in [22] requires one to compute an additional dual step.) This algorithm is called dual decomposition with two primal steps (Algorithm 1). Alternatively, we apply the switching strategy of [27] to obtain a decomposition algorithm
with switching primal-dual steps for solving problem (1). This algorithm differs from
the one in [27] at two points. First, the smoothness parameter is dynamically updated
with an exact formula. Second, proximal-based mappings are used to handle the nonsmoothness of the objective function. The second point is more significant since, in
practice, estimating the Lipschitz constants is not an easy task even if the objective
function is differentiable. We notice that the proximal-based mappings proposed in
this paper only play a role for handling the nonsmoothness of the objective function.
Therefore, the algorithms developed in this paper do not belong to any proximal-point
algorithm class considered in the literature. The approach presented in the present

paper is different from splitting methods and alternating methods considered in the
literature, see, e.g. [2, 7, 13, 15] in the sense that it solves the convex subproblems
of each component simultaneously without transforming the original problem to any
equivalent form. Moreover, all algorithms are first order methods which can be implemented in a highly parallel and distributed manner.
Contribution The contribution of this paper is the following:
1. We apply the Lagrangian relaxation, smoothing and excessive gap techniques
to large-scale separable convex optimization problems which are not necessarily
smooth. Note that the excessive gap condition that we use in this paper is different from the one in [27], where not only the duality gap is measured but also the
feasibility gap is used in the framework of constrained optimization, see Lemma 3.
2. We propose two algorithms for solving general separable convex optimization
problems. The first algorithm is new, while the second one is a new variant of the

78

Q. Tran Dinh et al.

first algorithm proposed by Nesterov in [27, Algorithm 1] applied to Lagrangian
dual decomposition. Both algorithms allow us to obtain the primal and dual approximate solutions simultaneously. Moreover, all the algorithm parameters are
updated automatically without any tuning procedure. A special case of these algorithms, a new method for solving problem (1) with a strongly convex objective
function is studied. All the algorithms are highly parallelizable and distributed.
3. The convergence of the algorithms is proved and the convergence rate is estimated.
In the two first algorithms, this convergence rate is O( k1 ) which is much higher
than O( √1 ) in subgradient methods [6, 23], where k is the iteration counter. In
k

the last algorithm, the convergence rate is O( k12 ).
The rest of the paper is organized as follows. In the next section, we briefly describe the Lagrangian dual decomposition method [1] for separable convex optimization, the smoothing technique via prox-functions as well as excessive gap techniques
[27]. We also provide several technical lemmas which will be used in the sequel. Section 3 presents a new algorithm called decomposition algorithm with two primal steps
and estimates its worst-case complexity. Section 4 is a combination of the two primal

steps and the two dual steps schemes which we call decomposition algorithm with
switching primal-dual steps. Section 5 is an application of the two dual steps scheme
(53) to solve problem (1) with a strongly convex objective function. We also discuss
the implementation issues of the proposed algorithms and a theoretical comparison
of Algorithms 1 and 2 in Sect. 6. Numerical examples are presented in Sect. 7 to
examine the performance of the proposed algorithms and to compare different methods.
n
Notation Throughout the paper, we shall consider the Euclidean
√space R endowed
with an inner product x T y for x, y ∈ Rn and the norm x := x T x. The notation
x := (x1 , . . . , xM ) represents a column vector in Rn , where xi is a subvector in Rni ,
i = 1, . . . , M and n1 + · · · + nM = n.

2 Lagrangian dual decomposition and excessive gap smoothing technique
A classical technique to address coupling constraints in optimization is Lagrangian
relaxation [1]. However, this technique often leads to a nonsmooth optimization problem in the dual form. To overcome this situation, we combine the Lagrangian dual
decomposition and smoothing technique in [26, 27] to obtain a smoothly approximate
dual problem.
For simplicity of discussion, we consider problem (1) with M = 2. However, the
methods presented in the next sections can be directly applied to the case M > 2 (see
Sect. 6). Problem (1) with M = 2 can be rewritten as follows:
⎧
⎪
⎨ min φ(x) := φ1 (x1 ) + φ2 (x2 )
φ ∗ :=

x:=(x1 ,x2 )

s.t. A1 x1 + A2 x2 = b,
⎪

⎩
x ∈ X1 × X2 := X,

(2)

Excessive gap smoothing techniques in Lagrangian dual decomposition

79

where φi , Xi and Ai are defined as in (1) for i = 1, 2 and b ∈ Rm . Problem (2) is
said to satisfy the Slater constraint qualification condition if ri(X) ∩ {x = (x1 , x2 ) |
A1 x1 + A2 x2 = b} = ∅, where ri(X) is the relative interior of the convex set X. Let
us denote by X ∗ the solution set of this problem. We make the following assumption.
Assumption 1 The solution set X ∗ is nonempty and either the Slater qualification
condition for problem (2) holds or Xi is polyhedral. The function φi is proper, lower
semicontinuous and convex in Rn , i = 1, 2.

x

Note that the objective function φ is not necessarily smooth. For example, φ(x) =
n
1=
i=1 |x(i) |, which is nonsmooth and separable.

2.1 Decomposition via Lagrangian relaxation
Let us first define the Lagrange function of problem (2) as:
L(x, y) := φ1 (x1 ) + φ2 (x2 ) + y T (A1 x1 + A2 x2 − b),

(3)

where y ∈ Rm is the multiplier associated with the coupling constraint A1 x1 +
A2 x2 = b. Then, the dual problem of (2) can be written as:
d ∗ := maxm d(y),

(4)

d(y) := min L(x, y) := φ1 (x1 ) + φ2 (x2 ) + y T (A1 x1 + A2 x2 − b) ,

(5)

y∈R

where
x∈X

is the dual function.
Let A = [A1 , A2 ]. Due to Assumption 1, strong duality holds and we have:
d ∗ = maxm d(y)
y∈R

strong duality

=

min φ(x) | Ax = b = φ ∗ .
x∈X

(6)

Let us denote by Y ∗ the solution set of the dual problem (4). It is well known that Y ∗
is bounded due to Assumption 1.
Finally, we note that the dual function d defined by (5) can be computed separately
as:
d(y) = d1 (y) + d2 (y),

(7)

where
1
di (y) := min φi (xi ) + y T Ai xi − bT y,
xi ∈Xi
2

i = 1, 2.

(8)

We denote by xi∗ (y) a solution of the minimization problem in (8) (i = 1, 2) and
x ∗ (y) := (x1∗ (y), x2∗ (y)). The representation (7)–(8) is called a dual decomposition
of the dual function d. It is obvious that, in general, the dual function d is convex and
nonsmooth.

80

Q. Tran Dinh et al.

2.2 Smoothing via prox-functions
Let us recall the definition of a proximity function. A function pX is called a

proximity function (prox-function) of a given nonempty, closed and convex set
X ⊂ Rnx if pX is continuous, strongly convex with a convexity parameter σX > 0
and X ⊆ dom(pX ). Let x c be the prox-center of X which is defined as:
x c = arg min pX (x).

(9)

x∈X

Without loss of generality, we can assume that pX (x c ) = 0. Otherwise, we consider
the function pˆ X (x) := pX (x) − pX (x c ). Let:
DX := max pX (x) ≥ 0.

(10)

x∈X

We make the following assumption.
Assumption 2 Each feasible set Xi is endowed with a prox-function pi which has
a convexity parameter σi > 0. Moreover, 0 ≤ Di := maxxi ∈Xi pi (xi ) < +∞ for
i = 1, 2.
Particularly, if Xi is bounded then Assumption 2 is satisfied. Throughout the paper,
we assume that Assumptions 1 and 2 are satisfied.
Now, we consider the following functions:
1
di (y; β1 ) := min φi (xi ) + y T Ai xi + β1 pi (xi ) − bT y,
xi ∈Xi
2

i = 1, 2,

d(y; β1 ) := d1 (y; β1 ) + d2 (y; β1 ).

(11)
(12)

Here, β1 > 0 is a given parameter called smoothness parameter. We denote by
xi∗ (y; β1 ) the solution of (11), i.e.:
1
xi∗ (y; β1 ) := arg min φi (xi ) + y T Ai xi + β1 pi (xi ) − bT y ,
2
xi ∈Xi

i = 1, 2. (13)

Note that we can use different parameters β1i for (11) (i = 1, 2).
The following lemma shows the main properties of d(·; β1 ), whose proof can be
found, e.g., in [22, 27].
Lemma 1 For any β1 > 0, the function di (·; β1 ) defined by (11) is well-defined, concave and continuously differentiable on Rm . The gradient ∇y di (y; β1 ) = Ai xi∗ (y; β1 )
2

− 12 b is Lipschitz continuous with a Lipschitz constant Ldi (β1 ) = βA1iσi (i = 1, 2).
Consequently, the function d(·; β1 ) defined by (12) is concave and differentiable.
Its gradient is given by ∇dy (y; β1 ) := Ax ∗ (y; β1 ) − b which is Lipschitz continuous
with a Lipschitz constant Ld (β1 ) :=

1
β1

Ai

2
i=1 σi

2

. Moreover, it holds that:

d(y; β1 ) − β1 (D1 + D2 ) ≤ d(y) ≤ d(y; β1 ),
and d(y; β1 ) → d(y) as β1 ↓ 0+ for any y ∈ Rm .

(14)

Excessive gap smoothing techniques in Lagrangian dual decomposition

81

Remark 1 Even without the boundedness of X, if the solution set X ∗ of (2) is
bounded then, in principle, we can bound the feasible set X by a large compact
set which contains all the sampling points generated by the algorithms (see Sect. 4
below). However, in the following algorithms we do not use Di , i = 1, 2 (defined
by (10)) in any computational step. They only appear in the theoretical complexity
estimates.
Next, for a given β2 > 0, we define a mapping ψ(·; β2 ) from X to R by:
ψ(x; β2 ) := maxm (Ax − b)T y −
y∈R

β2
y
2

2

(15)

.

This function can be considered as a smoothed version of ψ(x) := maxy∈Rm {(Ax −
b)T y} via the prox-function p(y) := 12 y 2 . It is easy to show that the unique solution
of the maximization problem in (15) is given explicitly as y ∗ (x; β2 ) = β12 (Ax − b)
and ψ(x; β2 ) =
on X. Let:

Ax − b 2 . Therefore, ψ(·; β2 ) is well-defined and differentiable

1
2β2

f (x; β2 ) := φ(x) + ψ(x; β2 ) = φ(x) +

1
Ax − b 2 .
2β2

(16)

The next lemma summarizes the properties of ψ(·; β2 ) and f (·; β2 ).
Lemma 2 For any β2 > 0, the function ψ(·; β2 ) defined by (15) is a quadratic function of the form ψ(x; β2 ) = 2β1 2 Ax − b 2 on X. Its gradient vector is given by:
∇x ψ(x; β2 ) =

1 T
A (Ax − b),
β2

(17)

which is Lipschitz continuous with a Lipschitz constant Lψ (β2 ) :=
A2 2 ). Moreover, the following estimate holds for all x, xˆ ∈ X:

1
β2 (

A1

2

+

ψ(x; β2 ) ≤ ψ(x;
ˆ β2 ) + ∇x1 ψ(x;
ˆ β2 )T (x1 − xˆ1 ) + ∇x2 ψ(x;
ˆ β2 )T (x2 − xˆ2 )
ψ

L1 (β2 )
x1 − xˆ1
2

+

ψ

2

+

L2 (β2 )
x2 − xˆ2 2 ,
2

(18)

and
f (x; β2 ) −
ψ

where L1 (β2 ) :=

2
β2

A1

2

1
Ax − b
2β2
ψ

and L2 (β2 ) :=

2

2
β2

= φ(x) ≤ f (x; β2 ),
A2 2 .

Proof It is sufficient to only prove (18). Since ψ(x; β2 ) =
we have:

1
2β2

ψ(x; β2 ) − ψ(x;
ˆ β2 ) − ∇x ψ(x;
ˆ β2 )T (x − x)
ˆ
=

(19)

1
A1 (x1 − xˆ1 ) + A2 (x2 − xˆ2 )
2β2

2

A1 x1 + A2 x2 − b 2 ,

82

Q. Tran Dinh et al.

≤

1
A1
β2

2

x1 − xˆ1

2

+

1
A2
β2

2

x2 − xˆ2 2 .

(20)

This inequality is indeed (18). The inequality (19) follows directly from (16).
2.3 Excessive gap technique
Since the duality gap of the primal and dual problems (2)–(4) is measured by
g(x, y) := φ(x) − d(y), if the gap g is equal to zero for some feasible point (x, y)
then this point is an optimal solution of (2)–(4). In this section, we apply a technique
called excessive gap proposed by Nesterov in [27] to the Lagrangian dual decomposition framework. First, we recall the following definition.
Definition 1 We say that a point (x,
¯ y)
¯ ∈ X × Rm satisfies the excessive gap condition with respect to two smoothness parameters β1 > 0 and β2 > 0 if:
f (x;
¯ β2 ) ≤ d(y;
¯ β1 ),

(21)

where f (·; β2 ) and d(·; β1 ) are defined by (19) and (12), respectively.
The following lemma provides an upper bound estimate for the duality gap and
the feasibility gap of problem (2).
Lemma 3 Suppose that (x,
¯ y)
¯ ∈ X × Rm satisfies the excessive gap condition (21).
∗
∗
Then for any y ∈ Y , we have:
− y∗

Ax¯ − b ≤ φ(x)
¯ − d(y)
¯ ≤ β1 (D1 + D2 ) −

1
Ax¯ − b
2β2

2

≤ β1 (D1 + D2 ),

(22)

and
Ax¯ − b ≤ β2

y∗ +

y∗

2

+

2β1
(D1 + D2 )
β2

1/2

.

(23)

Proof Suppose that x¯ and y¯ satisfy the condition (21). For a given y ∗ ∈ Y ∗ , one has:
d(y)
¯ ≤ d y ∗ = min φ(x) + (Ax − b)T y ∗
x∈X

≤ φ(x)
¯ + (Ax¯ − b)T y ∗ ≤ φ(x)
¯ + Ax¯ − b

y∗ ,

which implies the first inequality of (22). By using Lemma 1 and (16) we have:
φ(x)
¯ − d(y)
¯

(14)+(19)

≤

f (x;
¯ β2 ) − d(y;
¯ β1 ) + β1 (D1 + D2 ) −

1
Ax¯ − b 2 .
2β2

Now, by substituting the condition (21) into this inequality, we obtain the second
inequality of (22). Let η := Ax − b . It follows from (22) that η2 − 2β2 y ∗ η −
2β1 β2 (D1 + D2 ) ≤ 0. The estimate (23) follows from this inequality after few simple
calculations.

Excessive gap smoothing techniques in Lagrangian dual decomposition

83

3 New decomposition algorithm
In this section, we derive an iterative decomposition algorithm for solving (2) based
on the excessive gap technique. This method is called a decomposition algorithm with
two primal steps. The aim is to generate a point (x,
¯ y)
¯ ∈ X × Rm at each iteration such
that this point maintains the excessive gap condition (21) while the algorithm drives
the parameters β1 and β2 to zero.
3.1 Finding a starting point
As assumed earlier, the function φi is convex but not necessarily differentiable.
Therefore, we can not use the gradient information of these functions. We consider
the following mappings (i = 1, 2):
Pi (x;
ˆ β2 ) := arg min φi (xi ) + y ∗ (x;
ˆ β2 )T Ai (xi − xˆi ) +
xi ∈Xi

ψ

Li (β2 )

xi − xˆi
2

2

,
(24)

where y ∗ (x;
ˆ β2 ) := β12 (Axˆ − b). Since Li (β2 ) defined in Lemma 2 is positive,
Pi (·; β2 ) is well-defined. This mapping is called a proximal operator [3]. Let
P (·; β2 ) = (P1 (·; β2 ), P2 (·; β2 )).
First, we show in the following lemma that there exists a point (x,
¯ y)
¯ satisfying the
excessive gap condition (21). The proof of this lemma can be found in the Appendix.
ψ

Lemma 4 Suppose that x c is the prox-center of X. For a given β2 > 0, let:
y¯ := β2−1 Ax c − b

and x¯ := P x c ; β2 .

(25)

If the parameter β1 is chosen such that:
β1 β2 ≥ 2 max

1≤i≤2

Ai
σi

2

,

(26)

then (x,
¯ y)
¯ satisfies the excessive gap condition (21).
3.2 Main iteration scheme
Suppose that (x,
¯ y)
¯ ∈ X × Rm satisfies the excessive gap condition (21). We generate
+
a new point (x¯ , y¯ + ) ∈ X × Rm by applying the following update scheme:
⎧
∗ ¯ β ),
⎪
1
⎨xˆ := (1 − τ )x¯ + τ x (y;
p
+
+ +
+
x¯ , y¯ := Am x,
⇐⇒
¯ y;

¯ β1 , β2 , τ
ˆ β2+ ),
y¯ := (1 − τ )y¯ + τy ∗ (x;
⎪
⎩ +
ˆ β2+ ),
x¯ := P (x;
(27)
β1+ := (1 − τ )β1

and β2+ = (1 − τ )β2 ,

(28)

where P (·; β2+ ) = (P1 (·; β2+ ), P2 (·; β2+ )) and τ ∈ (0, 1) will be chosen appropriately.

84

Q. Tran Dinh et al.

Remark 2 In the scheme (27), the points x ∗ (y;
¯ β1 ) = (x1∗ (y;
¯ β1 ), x2∗ (y;
¯ β1 )), xˆ =
+
+
¯ β1 ) and
(xˆ1 , xˆ2 ) and x¯ + = (x¯1 , x¯2 ) can be computed in parallel. To compute x ∗ (y;
x¯ + we need to solve two corresponding convex programs in Rn1 and Rn2 , respectively.

The following theorem shows that the scheme (27)–(28) maintains the excessive
gap condition (21).
Theorem 1 Suppose that (x,
¯ y)
¯ ∈ X × Rm satisfies (21) with respect to two values
β1 > 0 and β2 > 0. Then if the parameter τ is chosen such that τ ∈ (0, 1) and:
β1 β2 ≥

2τ 2
max
(1 − τ )2 1≤i≤2

Ai
σi

2

(29)

,

then the new point (x¯ + , y¯ + ) generated by the scheme (27)–(28) is in X × Rm and
maintains the excessive gap condition (21) with respect to two new values β1+ and β2+ .
Proof The last line of (27) shows that x¯ + ∈ X. Let us denote by yˆ := y ∗ (x;
ˆ β2+ ).
+
Then, by using the definition of d(·; β1 ), the second line of (27) and β1 = (1 − τ )β1 ,
we have:
d y¯ + ; β1+

=
line 2 (27)

=

min φ(x) + (Ax − b)T y¯ + + β1+ p1 (x1 ) + p2 (x2 )
x∈X

min φ(x) + (1 − τ )(Ax − b)T y¯ + τ (Ax − b)T yˆ
x∈X

+ (1 − τ )β1 p1 (x1 ) + p2 (x2 )
=

min (1 − τ ) φ(x) + (Ax − b)T y¯ + β1 p1 (x1 ) + p2 (x2 )

[1]

+ τ φ(x) + (Ax − b)T yˆ

(30)

x∈X

[2]

.

Now, we estimate the first term [·]1 in the last line of (30). Since β2+ = (1 − τ )β2 ,
one has:

ψ(x;
¯ β2 ) =

1
Ax¯ − b
2β2

2

= (1 − τ )

1
Ax¯ − b
2β2+

2

= (1 − τ )ψ x;
¯ β2+ .

(31)

Moreover, if we denote by x 1 := x ∗ (y;
¯ β1 ) then, by the strong convexity of p1 and
p2 , (31) and f (x;
¯ β2 ) ≤ d(y;
¯ β1 ), we have:
[·]1

=

φ(x) + (Ax − b)T y¯ + β1 p1 (x1 ) + p2 (x2 )

≥

min φ(x) + (Ax − b)T y¯ + β1 p1 (x1 ) + p2 (x2 )

=

x∈X

1
2
2
+ β1 σ1 x1 − x11 + σ2 x2 − x21
2
1
2
d(y;
¯ β1 ) + β1 σ1 x1 − x11 + σ2 x2 − x21
2

2

Excessive gap smoothing techniques in Lagrangian dual decomposition

85

1

2
2
f (x;
¯ β2 ) + β1 σ1 x1 − x11 + σ2 x2 − x21
2
1
def. f (·;β2 )
2
2
=
φ(x)
¯ + ψ(x;
¯ β2 ) + β1 σ1 x1 − x11 + σ2 x2 − x21
2
1
(31)
2
2
=
φ(x)
¯ + ψ x;
¯ β2+ + β1 σ1 x1 − x11 + σ2 x2 − x21
2
(21)

≥

− τ ψ x; β2+
φ(x)
¯ + ψ x;

ˆ β2+ + ∇x ψ x;
ˆ β2+

(20)

=

+ σ2 x2 − x21

1
(x¯ − x)
ˆ + β1 σ1 x1 − x11
2

2

− τ ψ x;
¯ β2+ .

(32)

For the second term [·]2 in the last line of (30), we use the fact that yˆ =
ˆ β2
and ∇x ψ(x;
[·]2

) = AT yˆ

:=
def. y+(17)

ˆ

=

def. ψˆ

=

2

2

1
A(x¯ − x)
ˆ
2β2+

+

T

1
(Axˆ
β2+

− b)

to obtain:

φ(x) + (Ax − b)T yˆ = φ(x) + yˆ T A(x − x)

ˆ + (Axˆ − b)T yˆ
φ(x) + ∇x ψ x;
ˆ β2+

T

(x − x)
ˆ +

φ(x) + ψ x;
ˆ β2+ + ∇x ψ x;
ˆ β2+

1
Axˆ − b
β2+
T

2

(x − x)
ˆ + ψ x;
ˆ β2+ .

(33)

Substituting (32) and (33) into (30) and noting that (1 − τ )(x¯ − x)
ˆ + τ (x − x)
ˆ =
τ (x − x 1 ) due to the first line of (27), we obtain:

d y¯ + ; β1+

=
(32)+( 33)

≥

min (1 − τ )[·]1 + τ [·]2
x∈X

¯ + ψ x;
ˆ β2+ + ∇x ψ x;
min (1 − τ ) φ(x)
ˆ β2+
1
+ β1 σ1 x1 − x11
2

2

+ σ2 x2 − x21

− τ (1 − τ )ψ x;
¯ β2+ +

(x¯ − x)
ˆ

2

+ τ φ(x) + ψ x;
ˆ β2+ + ∇x ψ x;
ˆ β2+

=

T

x∈X

T

(x − x)
ˆ

(1 − τ )
A(x¯ − x)
ˆ
2β2+

2

+ τ ψ x;
ˆ β2+

¯ + τ φ(x) + ψ x;
ˆ β2+
min (1 − τ )φ(x)
x∈X

ˆ β2+
+ ∇x ψ x;

T

(1 − τ )(x¯ − x)
ˆ + τ (x − x)
ˆ

1
+ (1 − τ )β1 σ1 x1 − x11
2

2

+ σ2 x2 − x21

2

+ T1

86

Q. Tran Dinh et al.
φ-convex

≥

ˆ β2+ + τ ∇x ψ x;

min φ (1 − τ )x¯ + τ x + ψ x;
ˆ β2+
1
+ (1 − τ )β1 σ1 x1 − x11
2

where T1 :=

T

x − x1

x∈X

(1−τ )
2β2+

A(x¯ − x)
ˆ

2

2

2

+ σ2 x2 − x21

+ T1 ,

(34)

+ τ ψ(x;
ˆ β2+ ) − τ (1 − τ )ψ(x;
¯ β2+ ). Next, we note

that the condition (29) is equivalent to:
(1 − τ )β1 σi ≥

2τ 2
Ai
(1 − τ )β2

2

≥ Li β2+ τ 2 ,
ψ

i = 1, 2.

(35)

Moreover, if we denote by u := x¯ + τ (x − x)
¯ then:
u − xˆ = x¯ + τ (x − x)
¯ − xˆ = x¯ + τ (x − x)
¯ − (1 − τ )x¯ − τ x 1 = τ x − x 1 .

(36)

Now, by using Lemma 2, the condition (35) and (36), the estimation (34) becomes:
d y¯ + ; β1+ − T1

(36)

≥

φ(u) + ψ x;
ˆ β2+

min

u:=x+τ
¯ (x−x)∈
¯ x+τ
¯ (X−x)
¯

+ ∇x ψ(x;
ˆ β2 )T (u − x)
ˆ
+
x+τ
¯ (X−x)⊆X
¯

≥

≥

β1 (1 − τ )σ1
u1 − xˆ1
2τ 2

L1 (β2+ )
u1 − xˆ1
2

T

2

(u − x)
ˆ
2

(u − x)
ˆ

L2 (β2+ )
u2 − xˆ2
2
ψ

2

+

ˆ β2+ + ∇x ψ x;
φ x¯ + + ψ x;

ˆ β2+
ψ

≥

T

β1 (1 − τ )σ2
u2 − xˆ2
2τ 2

+

u∈X

L (β + )
+ 1 2 x¯1+ − xˆ1
2
(18)

2

ˆ β2+ + ∇x ψ x;
min φ(u) + ψ x;
ˆ β2+
ψ

=

β1 (1 − τ )σ2

u2 − xˆ2
2τ 2

+

u∈X

+
line 3 (27)

2

ˆ β2+ + φ(u) + ∇x ψ x;
min ψ x;
ˆ β2+
+

(35)

β1 (1 − τ )σ1
u1 − xˆ1
2τ 2

T

2

x¯ + − xˆ

L (β + )

+ 2 2 x¯2+ − xˆ2
2
ψ

2

2

φ x¯ + + ψ x¯ + ; β2+ = f x¯ + ; β2+ .

(37)

To complete the proof, we show that T1 ≥ 0. Indeed, let us define uˆ := Axˆ − b and
u¯ := Ax¯ − b, then uˆ − u¯ = A(xˆ − x).
¯ We have:
T1

def. ψ(·;β2 )

=

τ
Axˆ − b
2β2+

2

−

τ (1 − τ )

Ax¯ − b
2β2+

2

+

(1 − τ )
A(xˆ − x)
¯
2β2+

2

Excessive gap smoothing techniques in Lagrangian dual decomposition

=

1
τ uˆ
2β2+

=

1
τ uˆ
2β2+

2

− τ (1 − τ ) u¯

2

+ (1 − τ ) uˆ − u¯

2

− τ (1 − τ ) u¯

2

+ (1 − τ ) uˆ

+ (1 − τ ) u¯

2

1
2β2+

=

1
uˆ − (1 − τ )u¯
2β2+

2

2

2

− 2(1 − τ )uˆ T u¯

=

uˆ

87

+ (1 − τ )2 u¯
2

2

− 2(1 − τ )uˆ T u¯

≥ 0.

(38)

Substituting (38) into (37) we obtain the inequality d(y¯ + ; β1+ ) ≥ f (x¯ + ; β2+ ).
Remark 3 If φi is convex and differentiable such that its gradient is Lipschitz continuous with a Lipschitz constant Lφi ≥ 0 for some i = 1, 2, then instead of using the
proximal mapping Pi (·; β2 ) in (27) we can use the following mapping:
Gi x;
ˆ β2+ := arg min ∇φi (xˆi )T (xi − xˆi )
xi ∈Xi

ψ
Lˆ i (β2+ )
xi − xˆi
2

+ y ∗ (x;
ˆ β2 )T Ai (xi − xˆi ) +
ψ
where Lˆ i (β2+ ) := Lφi +

2 Ai
β2+

2

2

,

(39)

. Indeed, let us prove the condition d(y¯ + ; β1+ ) ≥

f (xˆ¯ + ; β2+ ), where G(x; β2 ) := (G1 (x1 ; β2 ), G2 (x2 ; β2 )) and xˆ¯ + := G(x;
ˆ β2+ ). First,
by using the convexity of φi and the Lipschitz continuity of its gradient, we have:
φi (xˆi ) + ∇φi (xˆi )T (ui − xˆi ) ≤ φi (ui )
≤ φi (xˆi ) + ∇φi (xˆi )T (ui − xˆi ) +

Lφ i

ui − xˆi
2

2

(40)

.

Next, by summing up the second inequality for i = 1, 2 and adding to (18) we have:
ˆ + ψ x;
ˆ β2+ + ∇φ(x)
ˆ + ∇x ψ x;
ˆ β2+
φ(u) + ψ u; β2+ ≤ φ(x)
+

ψ
Lˆ 1 (β2+ )
u1 − xˆ1
2

2

+

T

(u − x)
ˆ

ψ
Lˆ 2 (β2+ )
u2 − xˆ2 2 .
2

(41)

Finally, from the second inequality of (37) we have:
d y¯ + ; β1+ − T1

(35)

≥

ˆ β2+ + ∇x ψ x;
min φ(u) + ψ x;
ˆ β2+
u∈X

(1 − τ )β1 σ1
u1 − xˆ1
2τ 2
(1 − τ )β1 σ2
+
u2 − xˆ2
2τ 2
+

2

2

T

(u − x)
ˆ

88

Q. Tran Dinh et al.
φ-convex+(41)

≥

ˆ + ∇φ(x)
ˆ T (u − x)
min φ(x)
ˆ + ψ x;
ˆ β2+
u∈X

ˆ β2+
+ ∇x ψ x;
+

T

(u − x)

ˆ

ψ
Lˆ 1 (β2+ )
u1 − xˆ1
2

2

+

ψ
Lˆ 2 (β2+ )
u2 − xˆ2
2

2

φ(x)
ˆ + ψ x;
ˆ β2+

(39)

=

+ ∇φ(x)
ˆ + ∇x ψ x;
ˆ β2+
ψ

Lˆ 1 (β2+ ) +
+
xˆ¯1 − xˆ1
2
(41)

2

T

xˆ¯ + − xˆ

ψ
Lˆ 2 (β2+ ) +
+
xˆ¯2 − xˆ2
2

2

φ xˆ¯ + + ψ xˆ¯ + ; β2+ = f xˆ¯ + ; β2+ .

≥

In this case, the conclusion of Theorem 1 is still valid for the substitution x¯ˆ + :=
G(x;
ˆ β2+ ) provided that:
(1 − τ )
2 Ai 2
β1 σi ≥ Lφi +

,
2
(1 − τ )β2
τ

i = 1, 2.

(42)

If Xi is polytopic then problem (39) becomes a convex quadratic program.
3.3 The step size update rule
Next step we show how to update the parameter τ such that the conditions (26) and
(29) hold. From the update rule (28) we have β1+ β2+ = (1 − τ )2 β1 β2 . Suppose that
β1 and β2 satisfy the condition (29), i.e.:
β1 β2 ≥

τ2
L¯ 2 ,
(1 − τ )2

where L¯ := 2 max

1≤i≤2

Ai
σi

2

1/2

.

If we substitute β1 and β2 by β1+ and β2+ , respectively, in this inequality then we
τ+2
+ +
¯2
have β + β + ≥
= (1 − τ )2 β1 β2 , it implies β1 β2 ≥
2 L . However, since β β
1 2
(1−τ+ )
τ+2
2 . Therefore,
¯
L
(1−τ )2 (1−τ+ )2

This condition leads to τ ≥

1

if

τ2
(1−τ )2

τ+
1−τ+ .

≥

2
τ+2
(1−τ )2 (1−τ+ )2

then β1+ and β2+ satisfy (29).

Since τ, τ+ ∈ (0, 1), the last inequality implies:

0 < τ+ ≤

τ
< 1.
τ +1

(43)

Hence, (27)–(28) are well-defined. At the first iteration k = 0, both conditions (26)
and (29) need to be satisfied. This leads to 0 < τ0 ≤ 0.5.
Now, we define a rule to update the step size parameter τ .

Excessive gap smoothing techniques in Lagrangian dual decomposition

89

Lemma 5 Suppose that τ0 is arbitrarily chosen in (0, 12 ]. Then the sequence {τk }k≥0
generated by:
τk

τk+1 :=
(44)
τk + 1
satisfies the following equality:
τk =

τ0
,
1 + τ0 k

∀k ≥ 0.

(45)

Moreover, the sequence {βk }k≥0 generated by βk+1 = (1 − τk )βk for fixed β0 > 0
satisfies:
βk =

β0
,
τ0 k + 1

∀k ≥ 0.

(46)

Proof If we denote by t := τ1 and consider the function ξ(t) := t + 1 then the sequence {tk }k≥0 generated by the rule tk+1 := ξ(tk ) = tk + 1 satisfies tk = t0 + k for
1
0
= τ0 τk+1

for k ≥ 0. To prove (46), we observe that
all k ≥ 0. Hence τk = t1k = t0 +k

βk+1 = β0 ki=0 (1 − τi ). Hence, by substituting (45) into the last equality and carrying out some simple calculations, we get (46).

Remark 4 Since τ0 ∈ (0, 0.5], from Lemma 5 we see that with τ0 := 0.5 the righthand side estimate of (46) is minimized. In this case, the update rule of τk is simplified
1
to τk := k+2
for k ≥ 0.
3.4 The algorithm and its worst case complexity
Now, we combine the results of Lemma 4, Theorem 1 and Lemma 5 in order to build
the following algorithm.
Algorithm 1 (Decomposition algorithm with two primal steps)
Initialization: Perform the following steps:
1. Set τ0 := 0.5. Choose β10 > 0 and β20 > 0 as follows:
β10 = β20 := 2 max

1≤i≤2

Ai
σi

2

1/2

.

2. Compute x¯ 0 and y¯ 0 from (25) as:
y¯ 0 := β20

−1

Ax c − b

and x¯ 0 := P x c ; β20 .

Iteration: For k = 0, 1, . . . , perform the following steps:
1. If a given stopping criterion is satisfied then terminate.
2. Update the smoothness parameter β2k+1 := (1 − τk )β2k .

90

Q. Tran Dinh et al.

3. Compute x¯ik+1 in parallel for i = 1, 2 and y¯ k+1 by the scheme (27):
p

x¯ k+1 , y¯ k+1 := Am x¯ k , y¯ k ; β1k , β2k+1 , τk .
4. Update the smoothness parameter: β1k+1 := (1 − τk )β1k .
1
5. Update the step size parameter τk by: τk+1 := k+3
.
End.
p

As mentioned in Remark 2, there are two steps in the scheme Am of Algorithm 1
that can be parallelized. The first step is finding x ∗ (y¯ k ; β1 ) and the second one is
computing x¯ k+1 . In general, both steps require solving two convex programming

subproblems in parallel. The stopping criterion of Algorithm 1 will be discussed in
Sect. 6.
Note that the dual solution set Y ∗ is compact due to Assumption 1 and the conclusions of Lemma 3 hold for any y ∗ ∈ Y ∗ , we can define the following quantities:
y∗
DY ∗ := min
∗
∗

DX := D1 + D2 ,

y ∈Y

y∗ +
DY := min
∗
∗
y ∈Y

y∗

2

and
(47)

+ 2(D1 + D2 ) .

The next theorem provides the worst-case complexity estimate for Algorithm 1.
Theorem 2 Let {(x¯ k , y¯ k )} be a sequence generated by Algorithm 1. Then the following duality and feasibility gap estimates hold:
−DY ∗ Ax¯ k − b ≤ φ x¯ k − d y¯ k ≤

¯ X
2LD
,
k+2

(48)

and
Ax¯ k − b ≤

¯ Y
2LD
,
k+2

(49)

where
L¯ := 2 max

1≤i≤2

Ai
σi

2

1/2

and DX , DY ∗ and DY are defined in (47).
Proof By the choice of β10 = β20 = L¯ and Steps 1 in the initialization phase of Algorithm 1 we see that β1k = β2k for all k ≥ 0. Moreover, since τ0 = 0.5, by Lemma 5, we
¯

L
0
have β1k = β2k = τ0βk+1
= 0.5k+1
. Now, by applying Lemma 3 with β1 and β2 equal to
k
k
β1 and β2 respectively, we obtain the estimates (48) and (49).

Remark 5 The worst case complexity of Algorithm 1 is O( 1ε ). However, the constants
in the estimations (48) and (49) also depend on the choices of β10 and β20 , which

Excessive gap smoothing techniques in Lagrangian dual decomposition

91

satisfy the condition (26). The values of β10 and β20 will affect the accuracy of the
duality and feasibility gaps.

4 Switching decomposition algorithm
In this section, we apply a switching strategy to obtain a new variant of the first algorithm proposed in [27, Algorithm 1] for solving problem (2). This scheme alternately
switches between a two primal steps scheme and a two dual steps scheme depending
on the iteration counter k being even or odd.
4.1 The gradient mapping of the smoothed dual function
Since the smoothed dual function d(·; β1 ) is Lipschitz continuously differentiable on

Rm (see Lemma 1). We define the following mapping:
ˆ β1 )T (y − y)
ˆ −
G(y;
ˆ β1 ) := arg max ∇y d(y;
y∈Rm

2

Ld (β1 )
y − yˆ
2

2

,

(50)

2

where Ld (β1 ) := βA11σ1 + βA12σ2 and ∇y d(y;
ˆ β1 ) = A1 x1∗ (y;
ˆ β1 ) + A2 x2∗ (y;
ˆ β1 ) − b.
This problem can explicitly be solved to get the unique solution:
ˆ β1 ) − b .
G(y;
ˆ β1 ) = yˆ + Ld (β1 )−1 Ax ∗ (y;

(51)

4.2 A decomposition scheme with switching primal-dual steps
First, we adapt the scheme (27)–(28) in the switching primal-dual framework. Suppose that the pair (x,
¯ y)
¯ ∈ X × Rm satisfies the excessive gap condition (21). The two
primal steps scheme computes (x¯ + , y¯ + ) as follows:
⎧
∗ ¯ β ),
⎪
1
⎨xˆ := (1 − τ )x¯ + τ x (y;
+ +
p
+
∗
x¯ , y¯ := A (x,
¯ y;
¯ β1 , β2 , τ ) ⇐⇒
(52)
ˆ β2 ),
y¯ := (1 − τ )y¯ + τy (x;
⎪
⎩ +
ˆ β2 ),
x¯ := P (x;
and then updates β1+ := (1 − τ )β1 , where τ ∈ (0, 1) and P (·; β2 ) is defined in (24).
p
The difference between schemes Am and Ap is that the parameter β2 is fixed in Ap .
Symmetrically, the two dual steps scheme computes (x¯ + , y¯ + ) as:

⎧
∗ ¯ β ),
⎪
2
⎨yˆ := (1 − τ )y¯ + τy (x;
+ +
d
+
¯ y;
¯ β1 , β2 , τ ) ⇐⇒
x¯ , y¯ := A (x,
(53)
ˆ β1 ),
x¯ := (1 − τ )x¯ + τ x ∗ (y;
⎪
⎩ +
ˆ β1 ),
y¯ := G(y;
where τ ∈ (0, 1). The parameter β1 is kept unchanged, while β2 is updated by β2+ :=
(1 − τ )β2 .
The following result shows that (x¯ + , y¯ + ) generated either by Ap or by Ad maintains the excessive gap condition (21).

92

Q. Tran Dinh et al.

Lemma 6 Suppose that (x,
¯ y)
¯ ∈ X × Rm and satisfies (21) with respect to two values

β1 and β2 . Then if the parameter τ is chosen such that τ ∈ (0, 1) and:
β1 β2 ≥

2τ 2
max
1 − τ 1≤i≤2

Ai
σi

2

,

(54)

then the new point (x¯ + , y¯ + ) generated either by the scheme Ap or by Ad is in X ×Rm
and maintains the excessive gap condition (21) with respect to either two new values
β1+ and β2 or β1 and β2+ .
The proof of this lemma is quite similar to [27, Theorem 4.2.] that we omit here.
Remark 6 Given β1 > 0, we can choose β2 > 0 such that the condition (26) holds.
We compute a point (x¯ 0 , y¯ 0 ) as:
x¯ 0 := x ∗ y c ; β1

and y¯ 0 := G y c ; β1 = Ld (β1 )−1 (Ax¯ − b).

(55)

Then, similar to (25), the point (x¯ 0 , y¯ 0 ) satisfies (21). Therefore, we can use this point
as a starting point for Algorithm 2 below.

In Algorithm 2 below we apply either the scheme Ap or Ad by using the following
rule:
Rule A If the iteration counter k is even then apply Ap . Otherwise, Ad is used.
Now, we provide an update rule to generate a sequence {τk } such that the condition
2
(54) holds. Let L¯ 2 := 2 max1≤i≤2 { Aσii }. Suppose that at the iteration k the condition
(54) holds, i.e.:
β1k β2k ≥

τk2
L¯ 2 .
1 − τk

(56)

Note that, at the iteration k + 1, we either update β1k or β2k . Thus we have β1k+1 β2k+1 =
(1 − τk )β1k β2k . However, since the condition (56) holds, we have (1 − τk )β1k β2k ≥
τk2 L¯ 2 . Now, we suppose that the condition (54) is satisfied with β1k+1 and β2k+1 , i.e.:
2
τk+1

β1k+1 β2k+1 ≥
This condition holds if τk2 L¯ 2 ≥

1 − τk+1

2
τk+1
¯2
1−τk+1 L ,

2
which leads to τk+1
+ τk2 τk+1 − τk2 ≤ 0.

Since τk , τk+1 ∈ (0, 1), we obtain 0 < τk+1 ≤
rule for updating τk is:
τk+1 :=

τk
2

L¯ 2 .

τk
2[

τk2 + 4 − τk ] < τk . The tightest

τk2 + 4 − τk ,

(57)

Excessive gap smoothing techniques in Lagrangian dual decomposition

93

for all k ≥ 0 and τ0 ∈ (0, 1) given. Associated with {τk }, we generate two sequences
{β1k } and {β2k } as:

(1 − τk )β1k
β1k

if k is even
otherwise,

βk
β2k+1 := 2
(1 − τk )β2k

if k is even
otherwise,

β1k+1 :=

and
(58)

where β10 = β20 > 0 are fixed.
Lemma 7 Let {τk }, {β1k } and {β2k } be three sequences generated by (57) and (58),
respectively. Then:
√
√
β20 1 − τ0
(1 − τ0 )β10
2β10 1 − τ0
2β 0
k
< β1 <
, and

< β2k < 2 ,
(59)
2τ0 k + 1
τ0 k
2τ0 k + 1
τ0 k
for all k ≥ 1, provided that β10 = β20 > 0.
The proof of this lemma can be found in the Appendix.
¯

√

1−τ0
Remark 7 We can see that the function ηk (τ0 ) := 4τβ0 (k+τ
on the right-hand side of
0)
(59) is decreasing in (0, 1) for k ≥ 1. Therefore, we can choose τ0 as large as possible
to minimize ηk (·) in (0, 1). By
using the conditions (26) and (54), we can derive the
√
5−1
optimal value of τ0 as τ0 := 2 ≈ 0.618.

4.3 The algorithm and its worst-case complexity
Now, we combine the results of Remark 6, Lemmas 6 and 7 to build the following
algorithm.
Algorithm 2 (Decomposition algorithm with switching primal-dual steps)
Initialization: Perform the following steps:
√
2

1. Choose τ0 := 0.5( 5 − 1) and set β10 = β20 := [2 max1≤i≤2 { Aσii }]1/2 .
2. Compute x¯ 0 and y¯ 0 as:
x¯ 0 := x ∗ y c ; β10 ,

and y¯ 0 := Ld β10

−1

Ax¯ 0 − b .

Iteration: For k = 0, 1, . . . , perform the following steps:
1. If a given stopping criterion is satisfied then terminate.
2. If k is even then:
2a. Compute (x¯ k+1 , y¯ k+1 ) as:
x¯ k+1 , y¯ k+1 := Ap x¯ k , y¯ k ; β1k , β2k , τk .
2b. Update the smoothness parameter β1k as β1k+1 := (1 − τk )β1k .

94

Q. Tran Dinh et al.

3. Otherwise, i.e. if k is odd then:
3a. Compute (x¯ k+1 , y¯ k+1 ) as:
x¯ k+1 , y¯ k+1 := Ad x¯ k , y¯ k ; β1k , β2k , τk .
3b. Update the smoothness parameter β2k as β2k+1 := (1 − τk )β2k .
4. Update the step size parameter τk as: τk+1 := τ2k [(τk2 + 4)1/2 − τk ].
End.
The main steps of Algorithm 2 are Steps 2a and 2b, which require us to compute
either the two primal steps scheme Ap or the two dual steps scheme Ad . In Ap , we

need to solve two convex subproblem pairs in parallel, while in Ad , it only requires
to solve one convex subproblem pair in parallel. The following theorem shows the
convergence of this algorithm.
Theorem 3 Let the sequence {(x¯ k , y¯ k )}k≥0 be generated by Algorithm 2. Then the
duality and feasibility gaps satisfy:
−DY ∗ Ax¯ k+1 − b ≤ φ x¯ k+1 − d y¯ k+1 ≤
and
Ax¯ k+1 − b ≤

√
¯ Y
( 5 + 1)LD
,
k+1

¯ X
2LD
,
k+1

(60)

(61)

where
L¯ := 2 max

1≤i≤2

and DX , D

Y∗

Ai
σi

2

1/2

and DY are defined in (47).

Proof The conclusion
of this theorem follows directly from Lemmas 3 and 5, the
√
k
k
0
0
¯
condition τ0 = 5−1
,
β
1 = β2 = L and the fact that β1 ≤ β2 .
2
Remark 8 Note that the worst-case complexity of Algorithm 2 is still O( 1ε ). However,
the constant factor in the complexity estimate (48) is as same as the one in (60), while
the constant factor in (49) is smaller than the one in (61). As we discuss in Sect. 6
below, the rate of decrease of τk in Algorithm 2 is smaller than two times of τk in
Algorithm 1. Consequently, the sequences {β1k } and {β2k } generated by Algorithm 1

approach zero faster than the ones generated by Algorithm 2.
Remark 9 Note that the role of the schemes Ap and Ad in Algorithm 2 can be exchanged. Therefore, Algorithm 2 can be modified as follows to obtain a symmetric
variant:
1. At Step 2 of the initialization phase, we use (25) to compute x¯ 0 and y¯ 0 instead
of (55).

Excessive gap smoothing techniques in Lagrangian dual decomposition

95

2. At Steps 2a, we apply Ap if the iteration counter k is odd. Otherwise, we use Ad
at Step 3a.
3. At Steps 2b, β2k is updated if k is odd. Otherwise, β1k is updated at Step 3b.
5 Application to strongly convex programming problems
If φi (i = 1, 2) in (2) is strongly convex then the convergence rate of the dual scheme
(53) can be accelerated up to O( k12 ).
Suppose that φi is strongly convex with a convexity parameter σi > 0 (i = 1, 2).
Then the function d defined by (5) is well-defined, concave and differentiable. Moreover, its gradient is given by:
∇d(y) = A1 x1∗ (y) + A2 x2∗ (y) − b,
which is Lipschitz continuous with a Lipschitz constant Ld :=
excessive gap condition (21) in this case becomes:
¯
f (x;
¯ β2 ) ≤ d(y),

(62)
A1
σ1

2

+

A2
σ2

2

. The
(63)

for given x¯ ∈ X, y¯ ∈ Rm and β2 > 0. From Lemma 3 we conclude that if the point
(x,
¯ y)
¯ satisfies (63) then, for a given y ∗ ∈ Y ∗ , the following estimates hold:
−2β2 y ∗

2

≤ φ(x)
¯ − d(y)
¯ ≤ 0,

(64)

and
Ax¯ − b ≤ 2β2 y ∗ .

(65)

We now adapt the scheme (53) to this special case. Suppose (x,
¯ y)
¯ ∈ X × Rm satisfies
(63), we generate a new pair (x¯ + , y¯ + ) as:
⎧
∗ ¯ β ),
⎪
2
⎨yˆ := (1 − τ )y¯ + τy (x;
¯ y;
¯ β2 , τ ) ⇐⇒
(66)
x¯ + , y¯ + := Ads (x,
ˆ
x¯ + := (1 − τ )x¯ + τ x ∗ (y),
⎪
⎩ +
ˆ − b),
y¯ = yˆ + L1d (Ax ∗ (y)
where y ∗ (x;
¯ β2 ) = β12 (Ax¯ − b), and x ∗ (y) := (x1∗ (y), x2∗ (y)) is the solution of the
minimization problem in (5). The parameter β2 is updated by β2+ := (1 − τ )β2 and
τ ∈ (0, 1) will appropriately be chosen.
The following lemma shows that (x¯ + , y¯ + ) generated by (66) satisfies (63) whose
proof can be found in [27].
Lemma 8 Suppose that the point (x,
¯ y)
¯ ∈ X × Rm satisfies the excessive gap condition (63) with the value β2 > 0. Then if the parameter τ is chosen such that τ ∈ (0, 1)
and:

β2 ≥

τ 2 Ld
,
1−τ

(67)

96

Q. Tran Dinh et al.

then the new point (x¯ + , y¯ + ) computed by (66) is in X × Rm and also satisfies (63)
with a new parameter value β2+ .
Now, let us derive the rule to update the parameter τ . Suppose that β2 satisfies
τ2

(67). Since β2+ = (1 − τ )β2 , the condition (67) holds for β2+ if τ 2 ≥ 1−τ++ . Therefore, similar to Algorithm 2, we update the parameter τ by using the rule (44). The
conclusion of Lemma 7 still holds for this case.
Before presenting the algorithm, it is necessary to find a starting point (x¯ 0 , y¯ 0 )
which satisfies (63). Let β2 := Ld . We compute (x¯ 0 , y¯ 0 ) as:
x¯ 0 := x ∗ y c

and y¯ 0 :=

1
Ax¯ 0 − b .
Ld

(68)

It follows from Lemma 7.4 [27] that (x¯ 0 , y¯ 0 ) satisfies the excessive gap condition
(63).
Finally, the decomposition algorithm for solving the strongly convex programming problem of the form (2) is described in detail as follows:
Algorithm 3 (Decomposition algorithm for strongly convex objective function)
Initialization: Perform the following steps:
√
2
2
1. Choose τ0 := 0.5( 5 − 1). Set β20 := Aσ11 + Aσ22 .
2. Compute x¯ 0 and y¯ 0 as:
x¯ 0 := x ∗ y c

and y¯ 0 :=

1
Ax¯ 0 − b .
Ld

Iteration: For k = 0, 1, . . . , perform the following steps:
1. If a given stopping criterion is satisfied then terminate.
2. Compute (x¯ k+1 , y¯ k+1 ) using scheme (66):
x¯ k+1 , y¯ k+1 := Ads x¯ k , y¯ k ; β2k , τk .
3. Update the smoothness parameter as: β2k+1 := (1 − τk )β2k .
4. Update the step size parameter τk as: τk+1 := τ2k [(τk2 + 4)1/2 − τk ].
End.
The convergence of Algorithm 3 is stated as in Theorem 4 below.
Theorem 4 Let {(x¯ k , y¯ k )}k≥0 be a sequence generated by Algorithm 3. Then the
following duality and feasibility gaps are satisfied:

−

4Ld DY2 ∗
≤ φ x¯ k − d y¯ k ≤ 0,
(k + 2)2

(69)

and
Ax¯ k − b ≤

4Ld DY ∗
,
(k + 2)2

(70)

Excessive gap smoothing techniques in Lagrangian dual decomposition

where Lφ :=

A1
σ1

2

+

A2

σ2

2

97

and DY ∗ is defined in (47).

Proof From the update rule of τ k , we have (1 − τk+1 ) =
β2k+1 = (1 − τk )β2k , it implies that
k

β2k+1 = β20

(1 − τi ) =

β20 (1 − τ0 )
τ02

i=0

2
τk+1

τk2

. Moreover, since

τk2 .
d

(1−τ0 )
. With τ0 =
By using the inequalities (78) and β20 = Ld , we have β2k+1 < 4L
(τ0 k+2)2
√
4Ld
k
0.5( 5 − 1), one has β2 < (k+2)2 . By substituting this inequality into (64) and (65),
we obtain (69) and (70), respectively.
4Ld D 2

Theorem 4 shows that the worst-case complexity of Algorithm 3 is O( √ε Y ∗ ).
Moreover, at each iteration of this algorithm, only two convex problems need to be
solved in parallel.
6 Discussion on implementation and theoretical comparison
In order to apply Algorithm 1, 2 or 3 to solve the problem (1), we need to choose a
prox-function for each feasible set Xi for i = 1, . . . , M. The simplest prox-function
is pi (xi ) := 12 xi − xic 2 , for a given xic ∈ Xi . However, in some applications, we can
choose an appropriate prox-function such that it captures the structure of the feasible
set Xi . In (24), we have used the Euclidean distance to construct the proximal terms.
In principle, we can use a generalized Bregman distance instead of the Euclidean
distance, see [26] for more details.
6.1 Extension to a multi-component separable objective function
The algorithms developed in the previous sections can be directly applied to solve
problem (1) in the case M > 2. First, we provide the following formulas to compute
the parameters of Algorithms 1–3.
1. The constant L¯ in Theorems 2 and 3 is replaced by
Ai
σi

L¯ M = M max

1≤i≤M

1/2

2

.

2. The initial values of β10 and β20 in Algorithms 2 and 3 are β10 = β20 = L¯ M .
ψ
ψ
3. The Lipschitz constant Li (β2 ) in Lemma 2 is Li (β2 ) = β2−1 M Ai
1, . . . , M).
4. The Lipschitz constant Ld (β1 ) in Lemma 1 is
Ld (β1 ) :=

1
β1

M
i=1

Ai
σi

2

.

2

(i =

98

Q. Tran Dinh et al.

5. The Lipschitz constant Ld in Algorithm 3 is
M

Ld :=
i=1

Ai
σi

2

.

Note that these constants depend on the number of components M and the structure
of matrix Ai (i = 1, . . . , M).
Next, we rewrite the smoothed dual function d(y; β1 ) defined by (12) for the case
M > 2 as follows:
M

d(y; β1 ) =

di (y; β1 ),
i=1

where the function values di (y; β1 ) can be computed in parallel as:
di (y; β1 ) = −M −1 biT y + min φi (xi ) + y T Ai xi + β1 pi (xi ) ,
xi ∈Xi

∀i = 1, . . . , M.

ˆ β1 ) defined in (52) and (53) can respectively be
The quantities yˆ and y + := G(y;
expressed as:
M

yˆ := (1 − τ )y¯ + (1 − τ )
i=1
+

M

y := yˆ +
i=1

1
1
Ai x¯i − b ,
β2
M

1
1
Ai xi∗ (y;
ˆ β1 ) − b
Ld (β1 )
M

and

.

These formulas show that each component of yˆ and y + can be computed by only
using the local information and its neighborhood information. Therefore, both algorithms are highly distributed.
Finally, we note that if there exists a component φi of the objective function φ
ˆ β2 ) defined
which is Lipschitz continuously differentiable then the mapping Gi (x;
by (39) corresponding to the primal convex subproblem of this component can be
ˆ β2 ) defined by (24). This modification
used instead of the proximity mapping Pi (x;
can reduce the computational cost of the algorithms. The sequence {τk }k≥0 generated
by the rule (44) still maintains the condition (42) in Remark 3.
6.2 Theoretical comparison
Firstly, we compare Algorithms 1 and 2. From Lemma 3 and the proof of Theorems 2
and 3 we see that the rate of convergence of both algorithms is as same as of β1k
and β2k . At each iteration, Algorithm 1 updates simultaneously β1k and β2k by using
the same value of τk , while Algorithm 2 updates only one parameter. Therefore, to
update both parameters β1k and β2k , Algorithm 2 needs two iterations. We analyze the
update rule of τk in Algorithms 1 and 2 to compare the rate of convergence of both
algorithms.

Excessive gap smoothing techniques in Lagrangian dual decomposition

99

Let us define
ξ1 (τ ) :=

τ
τ +1

and ξ2 (τ ) :=

τ
2

τ2 + 4 − τ .

The function ξ2 can be rewritten as ξ2 (τ ) = √

τ
.
(τ/2)2 +1+τ/2

Therefore, we can easily

show that:
ξ1 (τ ) < ξ2 (τ ) < 2ξ1 (τ ).
{τkA1 }k≥0

and {τkA2 }k≥0 the two sequences generated by Algorithms 1
If we denote by
and 2, respectively then we have τkA1 < τkA2 < 2τkA1 for all k provided that 2τ0A1 ≥
τ0A2 . Since Algorithm 1 updates β1k and β2k simultaneously while Algorithm 2 updates
√
each of them at each iteration. If we choose τ0A1 = 0.5 and τ0A2 = 0.5( 5 − 1) in
Algorithms 1 and 2, respectively, then, by directly computing the values of τkA1 and
τkA2 , we can see that 2τkA1 > τkA2 for all k ≥ 0. Consequently, the sequences {β1k }
and {β2k } in Algorithm 1 converge to zero faster than in Algorithm 2. In other words,
Algorithm 1 is faster than Algorithm 2.
Now, we compare Algorithm 1, Algorithm 2 and Algorithm 3.2. in [22] (see also
[35]). Note that the smoothness parameter β1 is fixed in [22, Algorithm 3.2]. Moreover, this parameter is proportional to the given desired accuracy ε, i.e. β1 := DεX ,
which is often very small. Thus, the Lipschitz constant Ld (β1 ) is very large. Consequently, [22, Algorithm 3.2] makes a slow progress toward a solution at the very
early iterations. In Algorithms 1 and 2, the parameters β1 and β2 are dynamically
updated starting from given values. Besides, the cost per iteration of [22, Algorithm
3.2] is more expensive than Algorithms 1 and 2 since it requires to solve two convex
subproblem pairs in parallel and two dual steps.
6.3 Stopping criterion
In order to terminate the above algorithms, we can use the smooth function d(·; β1 )
to measure the stopping criterion. It is clear that if β1 is small, d(·; β1 ) is an approximation of the dual function d due to Lemma 1. Therefore, we can estimate the duality
gap φ(x) − d(y) by φ(x) − d(y; β1 ) and use this quantity in the stopping criterion.
More precisely, we terminate the algorithms if:
rpfgap :=

Ax¯ k − b
≤ εfeas ,
max{r 0 , 1}

(71)

and either the approximate duality gap satisfies:
f x¯ k ; β2k − d y¯ k ; β1k

≤ εfun max 1.0, d y¯ k ; β1k , f x¯ k ; β2k

,

(72)

or the value φ(x¯ k ) does not significantly change in jmax successive iterations, i.e.:
|φ(x¯ k ) − φ(x¯ k−j )|
≤ εobj
max{1.0, |φ(x¯ k )|}

for j = 1, . . . , jmax ,

where r 0 := Ax¯ 0 − b and εfeas , εfun and εobj are given tolerances.

(73)

DSpace at VNU: Combining Lagrangian decomposition and excessive gap smoothing technique for solving large-scale separable convex optimization problems

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về