Back To Proximal Algorithms Home.

Table of Contents

  1. Lasso
  2. Matrix decomposition
  3. Multi-period portfolio optimization
  4. Stochastic optimization

We can always use the following Matlab code to calculate the proximal operator:

  function x = prox_f_cvx(v, lambda, f, A)
  % For testing purposes.
      [n T] = size(v);
      cvx_begin quiet
          variable x(n,T)
          minimize(f(x,A) + (1/(2*lambda))*square_pos(norm(x - v,'fro')))
          subject to
              x <= 1;
              x >= -1;
      cvx_end
  end

7.1 Lasso

Lasso is short of Least Absoluate Shrinkage and Selection Operator. The problem is :

minimize(1/2)Axb22+γx1

7.1.1 Proximal gradient method

It carries out variable selection (by the l1 heuristic), and model fitting (by the least square). Consider the splitting :

f(x)=(1/2)Axb22,g(x)=γx1

With the gradient and proximal operator:

Δf(x)=AT(Axb),proxg(x)=Sγ(x)=(xγ)+(xγ)+

The proximal gradient method will be :

xk+1:=proxλkg(xkλkΔf(xk))=Sλkγ(xkλkAT(Axb))

The proximal gradient method is also called ISTA (iterative shrinkage-thresholding algorithm), while the accelerated version is known as FISTA (fast ISTA), the fast version is basicly adding a momentum. As the proximal gradient method can be interpreted as seperately optimize f and g.

  • We can further accelerate the algorithm by parallex matrix-vector multiplication.
  • Or even use the Gram matrix as mentioned in Chapter “Evaluating Proximal Operator”.
  • If we want solution for multiply γ, we can use the solution of the largest γ as warm starting.

7.1.2 ADMM

f is quadratic function, we can use the tricks as before:

  • Reuse of factorization.
  • Warm start with previous gradient, if use an iterative method.
  • if n much smaller, we can precompute the Gram matrix.
xk+1:=(ATA+(1/λ))1(ATb+(1/λ)(zkuk))zk+1:=Sλkγ(xk+1+uk)uk+1:=uk+xk+1zk+1

7.1.3 Test

Here we should pay attention to the update of x. With ARm×n, if m larger than n, we should use the expression upper. While in the case when m is smaller than n, we can reform the process to accelerate:

(ATA+(1/λ))x=ATb+(1/λ)(zkuk)q A(ATA+(1/λ))x=Aq (AATA+A(1/λ))x=Aq (AAT+(1/λ))Ax=Aq ATAx=AT(AAT+(1/λ))1Aqp

Using this in the original equation we have:

(ATA+(1/λ))x=q p+(1/λ)x=q x=λ(qp)

we will have the corresponding code in matlab as :

   L = chol(speye(m) + lambda*(A*A'), 'lower');
   L = sparse(L); U = sparse(L');
   q = Atb + rho*(z - u);
   x = lambda*(q - lambda*(A'*(U \ ( L \ (A*q) ))));

The original codes could be found here, or in my github page. The result run times are :

  • CVX time elapsed: 25.06 seconds.
  • Proximal gradient time elapsed: 0.35 seconds.
  • Fast prox gradient time elapsed: 0.17 seconds.
  • ADMM time elapsed: 0.04 seconds.

7.2 Matrix decomposition

The problem is to decompose matrix A into a sum of components Xi

minimizeϕ1(X1)+γ2ϕ2(X2)++γNϕN(XN)subjecttoX1+X2++XN=A

The function ϕ(X) can usually be seen as ‘penalties’, to drive Xi to have our objective properties.

  • Squared Frobenius norm: ϕ(X)=X2F=i,jX2i,j, to encourage X to be small.
  • Entrywise l1 norm: ϕ(X)=X1=i,jXi,j, to encourage X to be sparse.
  • Sum-column-norm: ϕ(X)=jxj2, to encourage column sparsity. (can be interpreted as group lasso regulization)
  • Elementwise constraints: Xi,jCi,j, for instant, we want to fixed some entries (fixed sparse pattern).
  • Separable convex function: ϕ(X)=mi=1nj=1ϕi,j(Xi,j). For instant, constrain the subblock of the matrix.
  • Semidefinite cone constraint: X0.
  • Nuclear norm: ϕ(X)=|X|=tr(XTX), encourage X to be low rank.

For an example, take ϕ1 be the Squred Frobenius norm,ϕ2 be the entrywise l1 norm, ϕ3 be the Nuclear norm, the problem can be reformed into:

minimizeA(X2+X3)2F+γ2X21+γ3X3

So we will decompose A into a sum of a small matrix X1, a sparse matrix X2, and a low rank matrix X3.

7.2.1 ADMM

Consider the splitting:

f(X)=Ni=1ϕi(Xi),g(X)=IC(X)

where X=(X1,,XN), and :

C={(X1,...,XN)Ni=1Xi=A}

f is to evulate the objective function, and g is to project onto C: the feasible set. The projection is fairly simple, which is similar to a translation of centroid:

ΠC(X)=XˉX+(1/N)A

So the final algorithms looks as follows:

Xk+1i:=proxλϕi(XkiˉXk+(1/N)AUk)Uk+1:=Uk+ˉXk+1(1/N)A

7.2.2 Test

Take the former example : ϕ1 be the Squred Frobenius norm,ϕ2 be the entrywise l1 norm, ϕ3 be the Nuclear norm. Note B=ˉXk(1/N)A+Uk So our updates of X is:

X1=proxλL2(X1B)=11+λ(X1B)X2=proxλL1(X2B)=Sγ2λ(X2B)X3=proxλNuclear(X3B)=Udiag(proxλf(σs(X3B)))VT

Corresponding codes are:

  X_1 = (1/(1+lambda))*(X_1 - B);
  X_2 = prox_l1(X_2 - B, lambda*g2);
  X_3 = prox_matrix(X_3 - B, lambda*g3, @prox_l1);

where prox_matrix is defined as :

  function [ Vout ] = prox_matrix(X, eta, prox_l1)
    [U,S,V] = svd(X);    %  X= U*S*V'
    Spos = prox_l1(S, eta);
    Vout = U * Spos * V';
  end

We get the output :

  CVX (vs true):
  |V| = 0.31;  |X_1| = 26.23
  nnz(S) = 49; nnz(X_2) = 53
  rank(L) = 4; rank(X_3) = 4

  ADMM (vs true):
  |V| = 0.31;  |X_1| = 26.18
  nnz(S) = 49; nnz(X_2) = 52
  rank(L) = 4; rank(X_3) = 4

  ADMM vs CVX solutions (in Frobenius norm):
  X_1: 3.59e-01; X_2: 6.15e-01; X_3: 5.30e-01

7.3 Multi-period portfolio optimization

Optimize the sum of a risk-adjusted negative return f and a transaction cost g, for a period of portfolio investment.

minimizeTt=1ft(xt)+Tt=1gt(xtxt1)

With the constraints that indicate any short position (x>0), and limit the sum o f liquid.

xt0,Ni=1xt,i1

Assume that ft and gt are closed proper convex and that ft are fully separable, i.e. the transaction cost in any period is the sum of the transaction costs for each asset. Let X=[x1,,xT]Rn×T donate the matrix of the portfoilo sequence.

Consider the splitting:

f(X)=Tt=1ft(xt)g(X)=Tt=1gt(xtxt1)

Where f is separable across the columns of X and g is separable across the rows of X.

Recall the update formula of ADMM :

xk+1:=proxλf(zkuk)zk+1:=proxλg(xk+1+uk)uk+1:=uk+xk+1zk+1

The update of each column of x will be solved using a CVX solver:

minimizeft(xt)+(1/2λ)xtzk+uk22subject toxt0,Ni=1xt,i1

The update of each rows of z will be solving the following problem:

minimizeNt=1gt,i(zt,izt1,i)+(1/2λ)zt,ixk+1uk22subject toz1=0

Code could be found in Stanford page, or in my github. The following image shows the time series of asset holdings.

In this case, ADMM method converges to the same optimal point as CVX solver, however it is much slower.

7.4 Stochastic optimization

Optimize the stochastic optimization, with π be a probability distribution, and f(k) is the closed proper convex objective function for scenario k.

minimizeE(f(x))=Kk=1πkf(k)(x)

Reform the problem into consensus form by introducing a consensus constraint.

minimizeE(f(x))=Kk=1πkf(k)(x(k))subject tox(1)==x(K)

To use ADMM method, the function f take the form of the upper objective function, while g take the form of a pojection onto the feasible set, by take the average of the local solutions x(k).

Example code could be found here. ADMM and CVX method solved in similar amount of time, while we should notice that in ADMM, the updates of variables could be processed in parallex.

Back To Proximal Algorithms Home.