Let’s think about the problem of solving \min||Ax-b|| where A is an m \times n matrix. Let’s instead look at minimizing the objective function f(x) = \|Ax-b\|^2.

First, we’ll calculate the gradient and set it equal to zero, since we know that must be the global minimum of our objective,
\nabla f = 2A^{T}(Ax-b) = 0
\implies A^{T}Ax = A^{T}b

Then, if A^{T}A has full rank (or, equivalently, A has full column rank), then x = (A^{T}A)^{-1}A^{T}b is the unique solution.


\min \|Ax-b\| = dist(b,S)

The proof that this minimum is unique is left as an exercise.

So, the optimal \hat{x} achieves A\hat{x} = \hat{b}.

Lemma 1.
A(A^{T}A)^{-1}A^{T} is the orthogonal projection from \mathbb{R}^m to S = span\{\vec{a_1} \dots \vec{a_n}\}.

Take any column of A, say \vec{a_i} = A\vec{e_i}. Then, A(A^{T}A)^{-1}A^{T}\vec{a_i} = A\vec{e_i} = \vec{a_i}
Now consider any y \bot \{\vec{a_i} \dots \vec{a_n}\}, so y \cdot \vec{a_i} = 0\ \forall i = 0 \dots n.
Then, A(A^{T}A)^{-1}A^{T}y = 0.
So, for any vector y \in \mathbb{R}^m, y = \sum_{i}\alpha_{i}\vec{a_i} + y'
A(A^{T}A)^{-1}A^{T}y = A(A^{T}A)^{-1}A^{T}(A\sum_{i}\alpha_{i}\vec{e_i})
= \sum_{i}\alpha_{i}\vec{a_i} is the projection of y onto span\{\vec{a_1} \dots \vec{a_n}\}. \square

What if the columns of A are not linearly independent, and so, (A^{T}A)^{-1} is not well defined?

Let’s get familiar with what’s known as the Singular Value Decomposition,
A = UDV^T, with the following definitions,
r = rank(A)
\sigma_1 \geq \dots \geq \sigma_r > 0
U = \{\vec{u_1} \dots \vec{u_r}\} orthonormal columns in \mathbb{R}^m
D = Diag(\sigma_1 \dots \sigma_r)
V = \{\vec{v_1} \dots \vec{v_r}\} orthonormal columns in \mathbb{R}^n

Theorem 1.
For any A \in \mathbb{R}^{m \times n}, the singular values are unique. The singular values are paired u_{i}, v_{i} such that ,
Av_i = \sigma_{i}\vec{u_i}
\vec{u_i}^{T}A = \sigma_{i}\vec{v_i}

If A is nonsingular, then A^{-1} = VD^{-1}U^{T}.

If A is nonsingular, m = n = r. Therefore, AA^{-1} = UDV^{T}VD^{-1}U^{T} = UU^{T}. We know U^{T}U = I \implies U^T = U^{-1} \implies UU^T = I. \square

To continue our search for \arg\min_{x}\|Ax-b\|, we will need to be familiar with the pseudo-inverse of A, A^+ = VD^{-1}U^{T}. Note the following properties,
U^{T}U = V^{T}V = D^{-1}D = I which we will use in the following,
AA^{+}A = UDV^{T}VD^{-1}U^{T}UDV^{T} = UDV^{T} = A
A^{+}AA^{+} = A^{+}

Theorem 2.
\arg\min_{x}\|Ax-b\| = A^{+}b

We know that the solution must satisfy A^{T}Ax = A^{T}b. Let’s verify that x = A^{+}b satisfies this condition.
A^{T}Ax = VDU^{T}UDV^{T}VD^{-1}U^{T}b = VDU^{T}b = A^{T}b
But, is this solution unique?
Ax = AA^{+}b = UDV^{T}VD^{-1}U^{T}b = UU^{T}b
Remember, U has orthonormal columns; so, it’s columns form a basis of span\{\vec{a_1} \dots \vec{a_n}\}.
\vec{u_i} \cdot b are the coordinates of b along $\vec{u_i}$.
The projection of b onto span\{\vec{a_1} \dots \vec{a_n}\} is exactly UU^{T}b.
So, A^{+}b is the unique minimum solution. \square