Here we will list some frequently faced cases, and introduce some usually used methods. As the calculation of the X update and the Z update are equivalent (as we could switch the parameter notations). We will discuss the update of X (as an example), and the same principle could be applied to Z updates. Consider the general X-update as follows:
\[X^{+} = \arg\min_{x}(f(x) + (\rho/2)\|Ax- v\|_{2}^{2})\]Where $v = - Bz + c - u$ is a known constant vector for X-update.
4.1 Proximity Operator
More detail about this could be seen in the proximal algorithm pages (corresponding to the proximal algorithm paper by S.Boyd, recommond to read).
If we consider A = I , we can simplify the x-update to a proximity operator:
\[X^{+} = \arg\min_{x}(f(x) + (\rho/2)\|x- v\|_{2}^{2}) = \mathbf{prox}_{q/\rho,f(x)}(v)\]This expression is closely related to the Moreau envelope or the Moreau Yosida regulaization with the expression:
\[M_{1/\rho} = f(x) + (\rho/2)\|x- v\|_{2}^{2}\]We have som useful results:
- Moreau envelop of l1 norm is huber function, and the proximity operator of l1 norm is the shrinkage function.
- Proximity operator of a indicator function is the projection onto that set.
4.2 Quadratic Objective Function
Consider the quadratic primal objective function (with P symmetric positive semidefinite):
\[f(x) = (1/2)x^{T}Px + q^{T}x + r\]Assume $P+\rho A^{T}A$ is invertible, then we can derivate the expression of the optimal x with the first order optimal condition:
\[\frac{\partial f}{\partial x} = Px^{+} + q + \rho A^{T}(Ax^{+}-v) = 0\] \[x^{+} = (P + \rho A^{T}A)^{-1}(\rho A^{T} v - q)\]We have several options to solve this linear equation:
- Solve directly, form the matrix $P+\rho A^{T}A, $ ($\mathcal{O}(pn^{2})$), then calculate and cache the factorization ($\mathcal{O}(n^{3})$), the back-solve cost $\mathcal{O}(n^{2})$.
- Further exploit the spairty to save the factorization cost.
- If we update $\rho$, we will have to update the factorization of $P+\rho A^{T}A$, which is a large cost. We should weight the update of $\rho$ and the update of factorization.
- We can use the matrix inverse lamma to simplify the inverse, while all the inverse in the expression exist. With the expression we can reduce the problem to a faster calcualtion, if the inverse of P could be calculated fast, or the dimension p is small.
Add an equation constraint Fx = g, with an addition constraint variable, the x update will be :
\[x^{+} = \arg\min_{x} ((1/2)x^{T}Px + q^{T}x + r + \lambda^{T}(Fx-g) +(\rho/2)\|x- v\|_{2}^{2})\]The first order optimal conditions are :
\[\begin{align*} \frac{\partial (\cdot)}{\partial x} &= (P + \rho A^{T}A)x + q - \rho A^{T}v + F^{T}\lambda = 0 \\ \frac{\partial (\cdot)}{\partial \lambda} &= Fx - g = 0 \end{align*}\]These equations corresponding to the solution of the following linear equation:
\[\begin{bmatrix} P + \rho A^{T}A & F^{T}\\ F & 0 \end{bmatrix} \begin{bmatrix} x^{+} \\ \lambda \end{bmatrix} + \begin{bmatrix}q-\rho A^{T}v \\ -g \end{bmatrix} = 0\]4.3 Smooth Objective Terms
Now consider some more complicate cases. When f is smooth, general iterative methods can be used to carry out the x-minimization. We can use the standard Newton’s method, or using some method for large problems, for examples the conjugate gradient method of the limited memory Broyden-Fletcher-Goldfarb-Shanno(L-BFGS) algorithm. The convergence depends on the condition of f function, the quadratic penalty term tends to improve the convergence.
- Early termination
- Warn start
4.4 Decomposition
By apply decomposition on the update, we could further use parallex calculation.