It to find a model best fit the measurements. It normally has a observation error term (‘l’ for loss), and a regularization term (r). l and r are chosen to be convex. As the following shows a problem to fit a linear model:
minimizel(Ax−b)+r(x)Assume l is additive:
l(Ax−b)=m∑i=1li(aTix−bi)li is the loss of ith training example. ai is the feature vector of example i (the ith system input), and bi is the ouput(response) of the example i (the ith observation).
- r choose l2 : r(x)=λ|x|22, is the tikhonov regularization or a ridge penalty.
- r choose l1 : r(x)=λ|x|1 is a lasso penalty.
-
In some case, part of the parameters should not be regularized (e.g. offset parameters).
- Split across training examples.
- Split across features.
8.1 Examples
- Regression
- Classification
- Image segmentation, denoise, decomposition.
8.2 Splitting across examples
Large amount of relatively low-dimensional data. Goal: solve in a distributed way. Partition A and b (example inputs and measurements):
A=[A1:AN],b=[b1:bN],The problem will be :
minimizeN∑i=1li(Aix−bi)+r(x)Reform the problem into consensus form to enable distributed calculation (turn into a standard ADMM type of problem):
minimizeN∑i=1li(Aixi−bi)+r(z)subject toxi−z=0, i=1,...,NUsing the scaled form of ADMM updates (see 7.3 sharing problems for more details):
xk+1i:=argminxi(li(Aixi−bi)+(ρ/2)‖xi−zk+uki‖22)zk+1:=argminz(r(z)+(Nρ/2)‖ˉxk+1−z+ˉuk‖22)uk+1i:=uki+xk+1i−zk+1- Lasso
- Sparse Logistic Regression
- Support Vector Machine
8.2.1 Group Lasso
For Lasso have the function f being squared l2 norm, and r being the l1 norm. Then the ADMM udpates are:
xk+1i:=argminxi((1/2)‖Aixi−bi‖22+(ρ/2)‖xi−zk+uki‖22)zk+1:=Sλ/ρN(ˉxk+1+ˉuk)uk+1i:=uki+xk+1i−zk+1See here for some details about the update of x. The difference from the serial version is that :
- The update of different group of variables xi could be carry out in parallel.
- The collection and averaging steps.
8.2.2 Distributed l_1-regularized logistic regression
Could be compared with the serial version. Could be seen function and Script using L-BFGS for distributed calculations.
8.2.3 Support Vector Machine
Here we model a linear support vector machine problem, which is a linear model fitting problem. Which is to find a best linear model applied to feature variables x (wTxj+b) to best fit the observation y (yj), where y is a binary variable.
Which is to say, if the observation yj is 1, we want wTxj+b→1 and if the observation yj is -1, we want wTxj+b→−1. Which is a optimization problem :
minimizeM∑j=1(1−yj(wTxj+b))In partice, we can truncate the results of the model to 1 or -1, so the problem will be better if we optimize this:
minimizeM∑j=1(1−yj(wTxj+b))+Where we have M obervations in total. The problem is equivalent to :
minimizeM∑j=1(1+[−yjxTj−yj][wb])+By forming :
A=[−y1xT1−y1::−yMxTM−yM],x=[wb],We have the reformed problem:
minimize(Ax+1)+Adding the regularization term of the linear model weights w:
minimize(Ax+1)++(1/2λ)‖w‖22If we apply the distributed model where i indicates a sub-set of samples, we have :
minimizeN∑i=1(Aix+1)++(1/2λ)‖w‖22Applying the consensus variable z :
minimizeN∑i=1(Aixi+1)++(1/2λ)‖w‖22subject toxi=zWe further simplify the problem with a small adjustment in the regularization term : instead of regularize w we will regularize z directly. Then we will have the problem:
minimize1T(Aixi+1)++(1/2λ)‖z‖22subject toxi=zThe corresponding ADMM updates are :
xk+1i:=argminxi(1T(Aixi+1)++(ρ/2)‖xi−zk+uki‖22)zk+1:=argminz((1/2λ)‖z‖22+N∑i=1(ρ/2)‖xk+1i−z+uki‖22)uk+1i:=uki+xk+1i−zk+1The update of x will be solved by another optimization problem:
cvx_begin
variable x_var(n)
minimize ( sum(pos(A{i}*x_var + 1)) + rho/2*sum_square(x_var - z(:,i) + u(:,i)) )
cvx_end
The update of z is simple, using the first order optimal condition we have :
(1/λ)zk+1+N∑i=1(−ρ(xk+1i−zk+1+uki))=0 zk+1=ρN(1/λ)+Nρ(ˉxk+1+ˉuk)Then, we get the final updates of ADMM of linear SVM :
xk+1i:=argminxi(1T(Aixi+1)++(ρ/2)‖xi−zk+uki‖22)zk+1:=ρN(1/λ)+Nρ(ˉxk+1+ˉuk)uk+1i:=uki+xk+1i−zk+1Code and Script could be found in ADMM Stanford page (A distributed version but solved in serial).
8.3 Splitting across Features
Model fitting problems with a modest number of examples and a large number of features.
- NLP(natural language processing) : pairs of adjucent words (bigrams), etc.
- Bioinformatics: DNA mutation, etc.
Partition of the parameter vector x as x=(x1,…,xN), and A as A=[A1,…,AN], the problem will be:
minimizel(N∑i=1Aixi−b)+N∑i=1ri(xi)Reform into consensus problem:
minimizel(N∑i=1zi−b)+N∑i=1ri(xi)subject toAixi−zi=0, i=1,...,NThe corresponding scaled form of ADMM is :
xk+1i:=argminxi(ri(xi)+(ρ/2)‖Aixi−zki+uki‖22)zk+1:=argminz(l(N∑i=1zi−b)+N∑i=1(ρ/2)‖Aixk+1i−zi+uki‖22)uk+1i:=uki+Aixk+1i−zk+1i