Optimization, Gradient Descent, Linear Algebra.
By Cristian Gutiérrez
3rd of August, 2023
The least squares method is a mathematical procedure for finding the best-fitting curve to a given set of points by minimizing the sum of the squares of the offsets, the residuals, of the points from the curve. The sum of the squares of the offsets is used instead of the offset absolute values because this allows the residuals to be treated as a continuous differentiable quantity.
We need to find the value of \(x\) that minimizes
$$ f(x) = \frac{1}{2} ||Ax - b||_{2}^{2}\,. $$where \(A\) is a matrix, \(x\) is our input vector and \(b\) is a vector of labels.
From the squared \(L^2\) norm and because \(Ax + b\) must be a vector, we can expand further on. Furthermore, consider that \(A^TB^T = (BA)^T\).
$$ f(x) = \frac{1}{2} \bigl(Ax - b \bigr)^T\bigl(Ax - b \bigr)= $$ $$ = \frac{1}{2} \bigl( x^TA^T - b^T \bigr)\bigl(Ax - b \bigr)= $$ $$ = \frac{1}{2} \bigl(x^TA^TAx - x^TA^Tb - b^TAx + b^Tb\bigr)\,. $$Also consider that we can swap the terms of a dot product \(A^TB=AB^T\).
In order to minimize the function for \(x\) we shall compute the gradient to see towards which direction should we move. Remember that the gradient points to the highest slope, thus, by moving towards the opposite direction of the gradient we will be able to minimize for \(x\) in an optimal way:
$$ \triangledown_x f(x) = \frac{\partial}{\partial x}\frac{1}{2} \bigl(x^TA^TAx - x^TA^Tb - b^TAx + b^Tb\bigr)=0\,, $$In order to be able to differentiate matrix and vectors operations, we write them as sums and then apply normal scalar differentiation.
We will use two rules, or known partial derivatives, that will come handy later on:
Now we can proceed to evaluate our gradient in which we will divide in the 4 terms inside the inner function:
Note that here we will perform some manipulations to put it easy, from The Matrix Cookbook, we have that:
$$ (A^TB^TC^T) = (CBA)^T\,, $$Furthermore, \(b^TAx\) will result in a matrix of dimensions \(1\times 1\), a scalar, and because it is a scalar we know for a fact that the transpose won't have effect, hence:
$$ b^TAx = (x^TA^Tb)^T = x^TA^Tb\,. $$We can now differentiate our third term:
$$ \frac{\partial}{\partial x}\, b^TAx = \frac{\partial}{\partial x}\,(x^TA^Tb)^T = \frac{\partial}{\partial x}\,x^TA^Tb = A^Tb\,. $$References