Solving the Linear Least Squares problem with the Gradient

Optimization, Gradient Descent, Linear Algebra.

By Cristian Gutiérrez

3rd of August, 2023

The least squares method is a mathematical procedure for finding the best-fitting curve to a given set of points by minimizing the sum of the squares of the offsets, the residuals, of the points from the curve. The sum of the squares of the offsets is used instead of the offset absolute values because this allows the residuals to be treated as a continuous differentiable quantity.

We need to find the value of $x$ that minimizes

$$ f(x) = \frac{1}{2} ||Ax - b||_{2}^{2}\,. $$

where $A$ is a matrix, $x$ is our input vector and $b$ is a vector of labels.

From the squared $L^2$ norm and because $Ax + b$ must be a vector, we can expand further on. Furthermore, consider that $A^TB^T = (BA)^T$.

$$ f(x) = \frac{1}{2} \bigl(Ax - b \bigr)^T\bigl(Ax - b \bigr)= $$ $$ = \frac{1}{2} \bigl( x^TA^T - b^T \bigr)\bigl(Ax - b \bigr)= $$ $$ = \frac{1}{2} \bigl(x^TA^TAx - x^TA^Tb - b^TAx + b^Tb\bigr)\,. $$

Also consider that we can swap the terms of a dot product $A^TB=AB^T$.

In order to minimize the function for $x$ we shall compute the gradient to see towards which direction should we move. Remember that the gradient points to the highest slope, thus, by moving towards the opposite direction of the gradient we will be able to minimize for $x$ in an optimal way:

$$ \triangledown_x f(x) = \frac{\partial}{\partial x}\frac{1}{2} \bigl(x^TA^TAx - x^TA^Tb - b^TAx + b^Tb\bigr)=0\,, $$

In order to be able to differentiate matrix and vectors operations, we write them as sums and then apply normal scalar differentiation.

We will use two rules, or known partial derivatives, that will come handy later on:

Rule 1. The difference of the dot product between two vectors, one being the transpose of our input will be equal to the other exponent. $$ \frac{\partial}{\partial x}\,x^Ty = \sum_j x_j y_j = y_j\,. $$
Rule 2. When we have a dot product between a matrix and a vector, both in its transposed mode and its normal mode. $$ \frac{\partial}{\partial x}\,x^TAx\rightarrow \frac{\partial}{\partial x} \sum_k x_k A_{jk} + \sum_j A_{ij} x_j \, = Ax + A^Tx\,. $$

Now we can proceed to evaluate our gradient in which we will divide in the 4 terms inside the inner function:

Term 1. We apply the second rule for $x^TAx$ in which our $A$ will be $A^TA$. $$ \frac{\partial}{\partial x}\,x^TA^TAx = 2A^TAx\,. $$
Term 2. We apply the first rule for $x^Ty$ in which our $y$ will be $A^Tb$. $$ \frac{\partial}{\partial x}\,x^TA^Tb = A^Tb\,. $$
Term 3. We apply the first rule for $x^Ty$ in which our $y$ will be $A^Tb$.
Note that here we will perform some manipulations to put it easy, from The Matrix Cookbook, we have that:
$$ (A^TB^TC^T) = (CBA)^T\,, $$
Furthermore, $b^TAx$ will result in a matrix of dimensions $1\times 1$, a scalar, and because it is a scalar we know for a fact that the transpose won't have effect, hence:
$$ b^TAx = (x^TA^Tb)^T = x^TA^Tb\,. $$
We can now differentiate our third term:
$$ \frac{\partial}{\partial x}\, b^TAx = \frac{\partial}{\partial x}\,(x^TA^Tb)^T = \frac{\partial}{\partial x}\,x^TA^Tb = A^Tb\,. $$
Term 4. Our last term is composed by $b$ and we consider them constanst, therefore: $$ \frac{\partial}{\partial x}\,b^Tb= 0\,. $$

And finally we can calculate the gradient function for our Linear Least Squares problem: $$ \triangledown_x\; \frac{1}{2}\;\bigl(2A^TAx - A^Tb - A^Tb + 0\bigr)\,, $$ $$ \triangledown_x\; \frac{1}{2}\;\bigl(2A^TAx - 2A^Tb - A^Tb\bigr)\,, $$ $$ \triangledown_x\; A^TAx - A^Tb\,. $$

References

Deep Learning Book, Ian Goodfellow and Yoshua Bengio and Aaron Courville (2016)
Differential Calculus Theory, CS224N, Stanford (2022)