prox_gradient

Solvers‎ > ‎

General Description:

proximal gradient method for solving problems of the form

$\min f(x)+\lambda g(x)$

Assumptions:

all functions are convex
f - smooth
g - proper closed and proximable
$\lambda$ - positive scalar

Available Oracles:

function value of f
gradient of f
function value of g
prox of a positive constant times g

(the function values of f and g are not required in the economical setting)

Usage:

out               = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par])
[out,fmin]        = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par])
[out,fmin,parout] = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par])

Input:

Ffun       - function handle for f
Ffun_grad  - function handle for the gradient of f
Gfun       - function handle for g (if g is extended real-valued, it is enough to specify the operation of g on its domain)
Gfun_prox  - function handle for the proximal mapping of g times a positive constant
lambda     - positive scalar
startx     - starting vector
par        - struct containing different values required for the operation of prox_gradient

 field of par  possible values  default  description
  max_iter   positive integer  1000 maximal number of iterations
eco_flag true/false   false true - economic version in which
function values are not computed 
 print_flag true/false   true true - internal printing is enabled
 Lstart positive real  1 an initial value for the Lipschitz constant
 const_flag  true/false  false true - constant stepsize is used
false - backtracking is employed
 regret_flag  true/false  false true - Lipschitz constant is divided by eta
       in the next iteration
false - otherwise
 eta  real>1  2 multiplicative constant used when regretting
or backtracking
 eps  positive real  1e-5 stopping criteria tolerance (the
method terminates when

with x^k being the kth iterate vector)

Output:

out   - optimal solution (up to a tolerance)
fmin     - optimal value (up to a tolerance)
parout   - a struct containing additional information related to the convergence. The fields of parout are:

iterNum - number of performed iterations

funValVec - vector of all function values generated by the method
LvalVec - vector of all used Lipschitz constants (relevant when par.const_flag=false)

Method Description:

Employs the proximal gradient method for solving

$\min f(x)+\lambda g(x)$

using the following update scheme:

$x^{k+1} = {\rm prox}_{t_k g} (x^k-t_k \nabla f(x^k))$

where the stepsize t_k given by t_k=1/L_k. L_k is either constant (if const_flag is true) or chosen by backtracking (default). The backtracking procedure either keeps L_k as it is, or increases it - each time by a multiplicative factor of eta. If regret_flag is true, then at the beginning of each iteration L_k is divided by eta.

The method stops when at least one of the following conditions hold: (1) the number of iterations exceeded max_iter (2) the norm of the difference between two consecutive iterates is smaller than eps.

References:

Amir Beck, "First Order Methods in Optimization", SIAM, 2017.
Amir Beck and Marc Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems, SIAM Journal on Imaging Sciences 2 (2009), no. 1, 183--202
Amir Beck and Marc Teboulle, Gradient-Based Algorithms with Applications to Signal Recovery Problems, in "Convex Optimization in Signal Processing and Communications". Editors: Yonina Eldar and Daniel Palomar. Cambridge university press, 2009.
Patrick L. Combettes and Valerie R. Wajs, Signal Recovery by Forward-Backward Splitting, Multiscale Modeling & Simulation 4 (2005), 1168-1200.

Example 1:

Consider the problem

$\min \frac{1}{2}\|Ax-b\|_2^2+2\|x\|_1$

where A and b are generated by the following commands

>> randn('seed',315);

>> A=randn(80,100);

>> b=randn(80,1);

We solve the problem using 100 iterations of prox_gradient by taking f(x)=norm(A*x-b,2)^2, lambda=2 and g(x)=norm(x,1). We use the fact that A'*(A*x-b) is the gradient of f and the function prox_l1, which is part of the package (for a list of available prox functions, go here)

>> clear par

>> par.max_iter=100;

>> [out,fmin,parout] =prox_gradient(@(x)0.5*norm(A*x-b,2)^2,@(x)A'*(A*x-b),...

@(x)norm(x,1),@(x,a)prox_l1(x,a),2,zeros(100,1),par);

*********************

prox_gradient

*********************

#iter fun. val. L val.

1 44.647100 256.000000

2 23.720870 256.000000

3 20.469023 256.000000

4 19.039811 256.000000

5 18.151359 256.000000

6 17.535138 256.000000

: : :

97 14.990022 256.000000

98 14.989947 256.000000

99 14.989876 256.000000

100 14.989808 256.000000

----------------------------------

Optimal value = 14.989744

Note that all the Lipschitz estimates were chosen as 256, meaning that the backtracking procedure had an effect only at the first iteration (in which the initial Lipschitz estimate 1 was increased to 256). If we allow regretting, then the Lipschitz estimates are different.

>> par.regret_flag=true

>> out=prox_gradient(@(x)0.5*norm(A*x-b,2)^2,@(x)A'*(A*x-b),@(x)norm(x,1),...

@(x,a)prox_l1(x,a),2,zeros(100,1),par);

*********************

prox_gradient

*********************

#iter fun. val. L val.

1 44.647100 256.000000

2 23.720870 128.000000

3 18.879081 128.000000

4 17.456805 128.000000

5 16.738304 128.000000

6 16.282443 256.000000

7 16.012172 128.000000

: : :

97 14.988546 64.000000

98 14.988546 256.000000

99 14.988546 128.000000

100 14.988546 64.000000

----------------------------------

Optimal value = 14.988546

Note that the obtained function value is better than the one reached by the version without regretting. In fact, in this example, trying to run prox_gradient with 1000 iterations results with only 102 iterations since the stopping criteria requiring sufficient change between iterates is met.

>> par.max_iter=1000;

*********************

prox_gradient

*********************

#iter fun. val. L val.

1 44.647100 256.000000

2 23.720870 128.000000

3 18.879081 128.000000

4 17.456805 128.000000

: : :

97 14.988546 64.000000

98 14.988546 256.000000

99 14.988546 128.000000

100 14.988546 64.000000

101 14.988546 64.000000

102 14.988546 256.000000

Stopping because the norm of the difference between consecutive iterates is too small

----------------------------------

Optimal value = 14.988546

In the economical setting, we do not require to provide any function values (only gradient of f and prox of g) and the method does not employ any backtracking/regretting procedure, and thus a constant stepsize is employed. Following is an example in which the method is run with constant L=256 for 100 iterations in the economical mode.

>> clear par

>> par.eco_flag=true;

>> par.max_iter=100;

>> par.Lstart=256;

>> out =prox_gradient([],@(x)A'*(A*x-b),[],@(x,a)prox_l1(x,a),2,zeros(100,1),par);

Printing is suppressed in the economical mode, but we can compute the obtained function value "manually"

>> 0.5*norm(A*out-b)^2+2*norm(out,1)

ans =

14.9897

Example 2:

Consider the problem

$\min \left \{ \log \left ( \sum_{i=1}^{20} e^{A_i x+b_i} \right ) : \|x-e\|_1 \leq 1 \right \}$

where A_1,A_2,...,A_20 are the rows of A and b_1,b_2,...,b_20 are the components b. A and b are generated as follows.

>> randn('seed',314);

>> A=randn(20,30);

>> b=randn(20,1);

We use prox_gradient to solve the problem

>> out =prox_gradient(@(x)log_sum_exp(A*x+b),@(x)A'*exp(A*x+b)/sum(exp(A*x+b)),...

@(x)0,@(x,a)proj_l1_ball(x-1)+1,1,zeros(30,1));

*********************

prox_gradient

*********************

#iter fun. val. L val.

1 8.266494 1.000000

2 7.487822 4.000000

3 7.443980 4.000000

4 7.431942 4.000000

5 7.423753 4.000000

: : :

28 7.349317 4.000000

29 7.349317 4.000000

30 7.349317 4.000000

31 7.349317 4.000000

Stopping because the norm of the difference between consecutive iterates is too small

----------------------------------

Optimal value = 7.349317

Example 3:

The prox_gradient solver does not require the function f to be convex. In cases where f is nonconvex (but still differentiable), the method is not guaranteed to converge to an optimal solution, but only to a stationary point. As an illustration, consider the problem

$\min_{x \in {\mathbb R}^3} \{ x^T A x : \|x\|_2 \leq 1\}$

where

>> A=[1,1,4;1,1,4;4,4,-2]

A =

1 1 4

4 4 -2

A is not positive semidefinite, and thus the problem is nonconvex. prox_gradient is guaranteed to converge to one of the stationary points, which in this case is either the zeros vectors or any normalized eigenvector corresponding to a nonpositive eigenvalue. Computing the spectral decomposition of A

>> [U,D]=eig(A)

U =

-0.4082 0.7071 0.5774

-0.4082 -0.7071 0.5774

0.8165 0 0.5774

D =

-6 0 0

0 0 0

0 0 6

Thus, there are 5 stationary points: the zeros vector and the two first columns of U along with their negatives. Starting with the initial vector [0;-1;0] actually results with the optimal solution.

>> out =prox_gradient(@(x)x'*A*x,@(x)2*A*x,@(x)0,@(x,a)proj_Euclidean_ball(x),1,[0;-1;0]);

*********************

prox_gradient

*********************

#iter fun. val. L val.

1 1.000000 8.000000

2 -3.538462 8.000000

3 -5.537778 8.000000

4 -5.925659 8.000000

5 -5.988165 8.000000

6 -5.998111 8.000000

7 -5.999698 8.000000

8 -5.999952 8.000000

9 -5.999992 8.000000

10 -5.999999 8.000000

11 -6.000000 8.000000

12 -6.000000 8.000000

13 -6.000000 8.000000

14 -6.000000 8.000000

Stopping because the norm of the difference between consecutive iterates is too small

----------------------------------

Optimal value = -6.000000

>> out

out =

-0.4082

-0.4083

0.8165

Starting from [1;1;1] produces the zeros vector which is just a stationary point.

>> out =prox_gradient(@(x)x'*A*x,@(x)2*A*x,@(x)0,@(x,a)proj_Euclidean_ball(x),1,[1;1;1]);

*********************

prox_gradient

*********************

#iter fun. val. L val.

1 6.000000 16.000000

2 0.375000 16.000000

3 0.023437 16.000000

4 0.001465 16.000000

5 0.000092 16.000000

6 0.000006 16.000000

7 0.000000 16.000000

8 0.000000 16.000000

9 0.000000 16.000000

10 0.000000 16.000000

Stopping because the norm of the difference between consecutive iterates is too small

----------------------------------

Optimal value = 0.000000

>> out

out =

1.0e-06 *

0.5506