General Description:
proximal gradient method for solving problems of the form
Assumptions:
Available Oracles:
(the function values of f and g are not required in the economical setting)
Usage:
out = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par]) [out,fmin] = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par]) [out,fmin,parout] = prox_gradient(Ffun,Ffun_grad,Gfun,Gfun_prox,lambda,startx,[par]) Input:
Ffun - function handle for f
Ffun_grad - function handle for the gradient of f
Gfun - function handle for g (if g is extended real-valued, it is enough to specify the operation of g on its domain)
Gfun_prox - function handle for the proximal mapping of g times a positive constant
lambda - positive scalar
startx - starting vector
par - struct containing different values required for the operation of prox_gradient
Output:out - optimal solution (up to a tolerance) fmin - optimal value (up to a tolerance) parout - a struct containing additional information related to the convergence. The fields of parout are: iterNum - number of performed iterations
funValVec - vector of all function values generated by the method Method Description:
Employs the proximal gradient method for solving
using the following update scheme:
where the stepsize t_k given by t_k=1/L_k. L_k is either constant (if const_flag is true) or chosen by backtracking (default). The backtracking procedure either keeps L_k as it is, or increases it - each time by a multiplicative factor of eta. If regret_flag is true, then at the beginning of each iteration L_k is divided by eta.
The method stops when at least one of the following conditions hold: (1) the number of iterations exceeded max_iter (2) the norm of the difference between two consecutive iterates is smaller than eps.
References:
Consider the problem
where A and b are generated by the following commands
>> randn('seed',315); >> A=randn(80,100); >> b=randn(80,1); We solve the problem using 100 iterations of prox_gradient by taking f(x)=norm(A*x-b,2)^2, lambda=2 and g(x)=norm(x,1). We use the fact that A'*(A*x-b) is the gradient of f and the function prox_l1, which is part of the package (for a list of available prox functions, go here)
>> clear par >> par.max_iter=100; >> [out,fmin,parout] =prox_gradient(@(x)0.5*norm(A*x-b,2)^2,@(x)A'*(A*x-b),... @(x)norm(x,1),@(x,a)prox_l1(x,a),2,zeros(100,1),par); ********************* prox_gradient ********************* #iter fun. val. L val. 1 44.647100 256.000000 2 23.720870 256.000000 3 20.469023 256.000000 4 19.039811 256.000000 5 18.151359 256.000000 6 17.535138 256.000000 : : : 97 14.990022 256.000000 98 14.989947 256.000000 99 14.989876 256.000000 100 14.989808 256.000000 ---------------------------------- Optimal value = 14.989744 Note that all the Lipschitz estimates were chosen as 256, meaning that the backtracking procedure had an effect only at the first iteration (in which the initial Lipschitz estimate 1 was increased to 256). If we allow regretting, then the Lipschitz estimates are different.
>> par.regret_flag=true >> out=prox_gradient(@(x)0.5*norm(A*x-b,2)^2,@(x)A'*(A*x-b),@(x)norm(x,1),... @(x,a)prox_l1(x,a),2,zeros(100,1),par); ********************* prox_gradient ********************* #iter fun. val. L val. 1 44.647100 256.000000 2 23.720870 128.000000 3 18.879081 128.000000 4 17.456805 128.000000 5 16.738304 128.000000 6 16.282443 256.000000 7 16.012172 128.000000 : : : 97 14.988546 64.000000 98 14.988546 256.000000 99 14.988546 128.000000 100 14.988546 64.000000 ---------------------------------- Optimal value = 14.988546 Note that the obtained function value is better than the one reached by the version without regretting. In fact, in this example, trying to run prox_gradient with 1000 iterations results with only 102 iterations since the stopping criteria requiring sufficient change between iterates is met.
>> par.max_iter=1000; ********************* prox_gradient ********************* #iter fun. val. L val. 1 44.647100 256.000000 2 23.720870 128.000000 3 18.879081 128.000000 4 17.456805 128.000000 : : :
97 14.988546 64.000000 98 14.988546 256.000000 99 14.988546 128.000000 100 14.988546 64.000000 101 14.988546 64.000000 102 14.988546 256.000000 Stopping because the norm of the difference between consecutive iterates is too small ---------------------------------- Optimal value = 14.988546 >> clear par >> par.eco_flag=true; >> par.max_iter=100; >> par.Lstart=256; >> out =prox_gradient([],@(x)A'*(A*x-b),[],@(x,a)prox_l1(x,a),2,zeros(100,1),par); Printing is suppressed in the economical mode, but we can compute the obtained function value "manually"
>> 0.5*norm(A*out-b)^2+2*norm(out,1) ans = 14.9897 Example 2:
Consider the problem
where A_1,A_2,...,A_20 are the rows of A and b_1,b_2,...,b_20 are the components b. A and b are generated as follows.
>> randn('seed',314); >> A=randn(20,30); >> b=randn(20,1); We use prox_gradient to solve the problem
>> out =prox_gradient(@(x)log_sum_exp(A*x+b),@(x)A'*exp(A*x+b)/sum(exp(A*x+b)),... @(x)0,@(x,a)proj_l1_ball(x-1)+1,1,zeros(30,1)); ********************* prox_gradient ********************* #iter fun. val. L val. 1 8.266494 1.000000 2 7.487822 4.000000 3 7.443980 4.000000 4 7.431942 4.000000 5 7.423753 4.000000 : : : 28 7.349317 4.000000 29 7.349317 4.000000 30 7.349317 4.000000 31 7.349317 4.000000 Stopping because the norm of the difference between consecutive iterates is too small ---------------------------------- Optimal value = 7.349317 Example 3:
The prox_gradient solver does not require the function f to be convex. In cases where f is nonconvex (but still differentiable), the method is not guaranteed to converge to an optimal solution, but only to a stationary point. As an illustration, consider the problem
where >> A=[1,1,4;1,1,4;4,4,-2] A = 1 1 4 1 1 4 4 4 -2 A is not positive semidefinite, and thus the problem is nonconvex. prox_gradient is guaranteed to converge to one of the stationary points, which in this case is either the zeros vectors or any normalized eigenvector corresponding to a nonpositive eigenvalue. Computing the spectral decomposition of A >> [U,D]=eig(A) U = -0.4082 0.7071 0.5774 -0.4082 -0.7071 0.5774 0.8165 0 0.5774 D = -6 0 0 0 0 0 0 0 6 Thus, there are 5 stationary points: the zeros vector and the two first columns of U along with their negatives. Starting with the initial vector [0;-1;0] actually results with the optimal solution. >> out =prox_gradient(@(x)x'*A*x,@(x)2*A*x,@(x)0,@(x,a)proj_Euclidean_ball(x),1,[0;-1;0]); ********************* prox_gradient ********************* #iter fun. val. L val. 1 1.000000 8.000000 2 -3.538462 8.000000 3 -5.537778 8.000000 4 -5.925659 8.000000 5 -5.988165 8.000000 6 -5.998111 8.000000 7 -5.999698 8.000000 8 -5.999952 8.000000 9 -5.999992 8.000000 10 -5.999999 8.000000 11 -6.000000 8.000000 12 -6.000000 8.000000 13 -6.000000 8.000000 14 -6.000000 8.000000 Stopping because the norm of the difference between consecutive iterates is too small ---------------------------------- Optimal value = -6.000000 >> out out = -0.4082 -0.4083 0.8165 Starting from [1;1;1] produces the zeros vector which is just a stationary point. >> out =prox_gradient(@(x)x'*A*x,@(x)2*A*x,@(x)0,@(x,a)proj_Euclidean_ball(x),1,[1;1;1]); ********************* prox_gradient ********************* #iter fun. val. L val. 1 6.000000 16.000000 2 0.375000 16.000000 3 0.023437 16.000000 4 0.001465 16.000000 5 0.000092 16.000000 6 0.000006 16.000000 7 0.000000 16.000000 8 0.000000 16.000000 9 0.000000 16.000000 10 0.000000 16.000000 Stopping because the norm of the difference between consecutive iterates is too small ---------------------------------- Optimal value = 0.000000 >> out out = 1.0e-06 * 0.5506 0.5506 0.5506 |
Solvers >