Solvers‎ > ‎

prox_gradient

General Description:

proximal gradient method for solving problems of the form 
Assumptions:
  • all functions are convex
  • f   - smooth
  • g  - proper closed and proximable
  •   - positive scalar
Available Oracles:
  • function value of f
  • gradient of f 
  • function value of g
  • prox of a positive constant times g
(the function values of f and g are not required in the economical setting)

Usage: 
out               = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par])
[out,fmin]        = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par])
[out,fmin,parout] = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par])

Input: 
Ffun       - function handle for f
Ffun_grad  - function handle for the gradient of f
Gfun       - function handle for g (if g is extended real-valued, it is enough to specify the operation of g on its domain)
Gfun_prox  - function handle for the proximal mapping of g times a positive constant
lambda     - positive scalar
startx     - starting vector
par        - struct containing different values required for the operation of prox_gradient

 field of par possible values default description
max_iter  positive integer 1000maximal number of iterations
eco_flagtrue/false  falsetrue - economic version in which function values are not computed
 print_flagtrue/false  truetrue - internal printing is enabled
 Lstartpositive real 1an initial value for the Lipschitz constant
 const_flag true/false falsetrue - constant stepsize is used false - backtracking is employed
 regret_flag true/false falsetrue - Lipschitz constant is divided by eta in the next iteration false - otherwise
 eta real>1 2multiplicative constant used when regretting or backtracking
 eps positive real 1e-5stopping criteria tolerance (the method terminates when with x^k being the kth iterate vector)

Output: 
out           - optimal solution (up to a tolerance)
fmin         - optimal value (up to a tolerance)
parout      - a struct containing  additional information related to the convergence. The fields of parout are:
                   iterNum    - number of performed iterations
funValVec - vector of all function values generated by the method
LvalVec - vector of all used Lipschitz constants (relevant when par.const_flag=false) 


Method Description: 

Employs the proximal gradient method for solving 
using the following update scheme:
where the stepsize t_k given by t_k=1/L_k. L_k is either constant (if const_flag is true) or chosen by backtracking (default). The backtracking procedure either keeps L_k as it is, or increases it - each time by a multiplicative factor of eta. If regret_flag is true, then at the beginning of each iteration L_k is divided by eta.

The method stops when at least one of the following conditions hold: (1) the number of iterations exceeded max_iter (2) the norm of the difference between two consecutive iterates is smaller than eps.

References: 


Example 1:

Consider the problem 
where A and b are generated by the following commands

>> randn('seed',315);
>> A=randn(80,100);
>> b=randn(80,1);

We solve the problem using 100 iterations of prox_gradient by taking f(x)=norm(A*x-b,2)^2, lambda=2 and g(x)=norm(x,1). We use the fact that A'*(A*x-b) is the gradient of f and the function prox_l1, which is part of the package (for a list of available prox functions, go here

>> clear par
>> par.max_iter=100;
>> [out,fmin,parout] =prox_gradient(@(x)0.5*norm(A*x-b,2)^2,@(x)A'*(A*x-b),...
@(x)norm(x,1),@(x,a)prox_l1(x,a),2,zeros(100,1),par);

*********************
prox_gradient
*********************
#iter       fun. val.     L val.
     1       44.647100  256.000000
     2       23.720870  256.000000
     3       20.469023  256.000000
     4       19.039811  256.000000
     5       18.151359  256.000000
     6       17.535138  256.000000
     :              :               :
    97       14.990022  256.000000
    98       14.989947  256.000000
    99       14.989876  256.000000
   100       14.989808  256.000000
----------------------------------
Optimal value =       14.989744 

Note that all the Lipschitz estimates were chosen as 256, meaning that the backtracking procedure had an effect only at the first iteration (in which the initial  Lipschitz estimate 1 was increased to 256). If we allow regretting, then the Lipschitz estimates are different. 

>> par.regret_flag=true
>> out=prox_gradient(@(x)0.5*norm(A*x-b,2)^2,@(x)A'*(A*x-b),@(x)norm(x,1),...
@(x,a)prox_l1(x,a),2,zeros(100,1),par);
*********************
prox_gradient
*********************
#iter       fun. val.     L val.
     1       44.647100  256.000000
     2       23.720870  128.000000
     3       18.879081  128.000000
     4       17.456805  128.000000
     5       16.738304  128.000000
     6       16.282443  256.000000
     7       16.012172  128.000000
     :                :             :
    97       14.988546   64.000000
    98       14.988546  256.000000
    99       14.988546  128.000000
   100       14.988546   64.000000
----------------------------------
Optimal value =       14.988546 

Note that the obtained function value is better than the one reached by the version without regretting. In fact, in this example, trying to run prox_gradient with 1000 iterations results with only 102 iterations since the stopping criteria requiring sufficient change between iterates is met. 

>> par.max_iter=1000;
*********************
prox_gradient
*********************
#iter       fun. val.     L val.
     1       44.647100  256.000000
     2       23.720870  128.000000
     3       18.879081  128.000000
     4       17.456805  128.000000
        :                         :                        :
    97       14.988546   64.000000
    98       14.988546  256.000000
    99       14.988546  128.000000
   100       14.988546   64.000000
   101       14.988546   64.000000
   102       14.988546  256.000000
Stopping because the norm of the difference between consecutive iterates is too small
----------------------------------
Optimal value =       14.988546 

In the economical setting, we do not require to provide any function values (only gradient of f and prox of g) and the method does not employ any backtracking/regretting procedure, and thus a constant stepsize is employed. Following is an example in which the method is run with constant L=256 for 100 iterations in the economical mode. 

>> clear par
>> par.eco_flag=true;
>> par.max_iter=100;
>> par.Lstart=256;
>> out =prox_gradient([],@(x)A'*(A*x-b),[],@(x,a)prox_l1(x,a),2,zeros(100,1),par);

Printing is suppressed in the economical mode, but we can compute the obtained function value "manually"

>> 0.5*norm(A*out-b)^2+2*norm(out,1)

ans =

   14.9897


Example 2:

Consider the problem 
where A_1,A_2,...,A_20 are the rows of A and b_1,b_2,...,b_20 are the components b. A and b are generated as follows. 

>> randn('seed',314);
>> A=randn(20,30);
>> b=randn(20,1);

We use prox_gradient to solve the problem

>> out =prox_gradient(@(x)log_sum_exp(A*x+b),@(x)A'*exp(A*x+b)/sum(exp(A*x+b)),...
@(x)0,@(x,a)proj_l1_ball(x-1)+1,1,zeros(30,1));
*********************
prox_gradient
*********************
#iter       fun. val.     L val.
     1        8.266494    1.000000
     2        7.487822    4.000000
     3        7.443980    4.000000
     4        7.431942    4.000000
     5        7.423753    4.000000
     :               :               :
    28        7.349317    4.000000
    29        7.349317    4.000000
    30        7.349317    4.000000
    31        7.349317    4.000000
Stopping because the norm of the difference between consecutive iterates is too small
----------------------------------
Optimal value =        7.349317 


Example 3:

The prox_gradient solver does not require the function f to be convex. In cases where f is nonconvex (but still differentiable), the method is not guaranteed to converge to an optimal solution, but only to a stationary point. As an illustration, consider the problem 

where 

>> A=[1,1,4;1,1,4;4,4,-2]

A =

     1     1     4
     1     1     4
     4     4    -2

A is not positive semidefinite, and thus the problem is nonconvex. prox_gradient is guaranteed to converge to one of the stationary points, which in this case is either the zeros vectors or any normalized eigenvector corresponding to a nonpositive eigenvalue. Computing the spectral decomposition of A

>> [U,D]=eig(A)

U =

   -0.4082    0.7071    0.5774
   -0.4082   -0.7071    0.5774
    0.8165         0    0.5774


D =

    -6     0     0
     0     0     0
     0     0     6

Thus, there are 5 stationary points: the zeros vector and the two first columns of U along with their negatives. Starting with the initial vector [0;-1;0] actually results with the optimal solution. 

>>  out =prox_gradient(@(x)x'*A*x,@(x)2*A*x,@(x)0,@(x,a)proj_Euclidean_ball(x),1,[0;-1;0]);
*********************
prox_gradient
*********************
#iter       fun. val.     L val.
     1        1.000000    8.000000
     2       -3.538462    8.000000
     3       -5.537778    8.000000
     4       -5.925659    8.000000
     5       -5.988165    8.000000
     6       -5.998111    8.000000
     7       -5.999698    8.000000
     8       -5.999952    8.000000
     9       -5.999992    8.000000
    10       -5.999999    8.000000
    11       -6.000000    8.000000
    12       -6.000000    8.000000
    13       -6.000000    8.000000
    14       -6.000000    8.000000
Stopping because the norm of the difference between consecutive iterates is too small
----------------------------------
Optimal value =       -6.000000 
>> out

out =

   -0.4082
   -0.4083
    0.8165

Starting from [1;1;1] produces the zeros vector which is just a stationary point.

>>   out =prox_gradient(@(x)x'*A*x,@(x)2*A*x,@(x)0,@(x,a)proj_Euclidean_ball(x),1,[1;1;1]);
*********************
prox_gradient
*********************
#iter       fun. val.     L val.
     1        6.000000   16.000000
     2        0.375000   16.000000
     3        0.023437   16.000000
     4        0.001465   16.000000
     5        0.000092   16.000000
     6        0.000006   16.000000
     7        0.000000   16.000000
     8        0.000000   16.000000
     9        0.000000   16.000000
    10        0.000000   16.000000
Stopping because the norm of the difference between consecutive iterates is too small
----------------------------------
Optimal value =        0.000000 
>> out

out =

   1.0e-06 *

    0.5506
    0.5506
    0.5506