# Singular Value Decomposition

Last time we saw that $\min \|Ax-b\|$ is achieved by $x = A^{+}b$ where $A^{+} = VD^{-1}U^T$ depended on the singular value decomposition of $A$.

This time we will explore singular value decomposition further.

Theorem 1.
Any $m \times n$ real matrix $A$ can be written as $A = UDV^T$ where $U,V$ have orthonormal columns and $D$ is a diagonal matrix with entries $\sigma_1 \geq \dots \geq \sigma_r > 0$ where $r = rank(A)$.

Note.
$AV = UDV^{T}V = UD$
$A^{T}U = VD$
or equivalently,
$A\boldsymbol{v^{(i)}} = \sigma_i \boldsymbol{u^{(i)}}$
$A^{T}\boldsymbol{u^{(i)}} = \sigma_i \boldsymbol{v^{(i)}}$

$\boldsymbol{u^{(i)}}$ is a left singular vector
$\boldsymbol{v^{(i)}}$ is a right singular vector

Lemma 1.
$\boldsymbol{v^{(1)}} = \arg\max_{\|\boldsymbol{v}\|=1}\|A\boldsymbol{v}\|^2$

Proof. (Linear Algebra)
$A^{T}A\boldsymbol{v} = \sigma A^{T}\boldsymbol{u} = \sigma^2 \boldsymbol{v}$

So, each $\boldsymbol{v^{(i)}}$ is an eigenvector of $A^{T}A$
and each $\boldsymbol{u^{(i)}}$ is an eigenvector of $AA^{T}$

For any eigenvector $\boldsymbol{v}$ of $A^{T}A$,
$A^{T}A\boldsymbol{v} = \lambda\boldsymbol{v}$
define $\boldsymbol{u} = \frac{1}{\sqrt{\lambda}}A\boldsymbol{v}$

Then, $A\boldsymbol{v} = \sqrt{\lambda}u$
and $A^{T}\boldsymbol{u} = \sqrt{\lambda}v$.

So $\arg\max_{\|\boldsymbol{v}\|\leq 1}\|A\boldsymbol{v}\|^2$ is the top eigenvector of $A^{T}A$, and hence, the top right singular vector of $A$. $\square$

Proof. (Calculus)
Consider the objective function $f(\boldsymbol{v}) = \|A\boldsymbol{v}\|^2 = \boldsymbol{v^{T}}A^{T}A\boldsymbol{v}$.

Then, $\nabla f(\boldsymbol{v}) = 2A^{T}A\boldsymbol{v}$.

Intuitively we can see that if $\boldsymbol{v}$ is a point on the surface of a sphere, defined by $\|v\| \leq 1$, then at any local optimum $\boldsymbol{u'}$, $\nabla f(\boldsymbol{u'})$ is parallel to $\boldsymbol{u'}$. The inverse is also true.

This relationship is represented by $2A^{T}A\boldsymbol{v} = \lambda \boldsymbol{v}$.

By the above, we can see that all local optima of $\|A\boldsymbol{v}\|^2$ are eigenvectors. So, $\max \|A\boldsymbol{v}\|^2$ is achieved by an eigenvector of $A^{T}A$, or a right singular vector of $A$. $\square$

Can we reason similarly about $\boldsymbol{v^{(i)}}, \boldsymbol{u^{(i)}}\ \forall\ i=2 \dots r$?

Lemma 2.
$\boldsymbol{v^{(1)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1}\|A\boldsymbol{v}\|^2$
$\boldsymbol{v^{(2)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1 \atop \boldsymbol{v} \bot \boldsymbol{v^{(1)}}}\|A\boldsymbol{v}\|^2$
$\vdots$

$\boldsymbol{v^{(k)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1 \atop \boldsymbol{v} \bot \boldsymbol{v^{(1)}} ... \boldsymbol{v^{(k-1)}}}\|A\boldsymbol{v}\|^2$

Then, $\boldsymbol{v^{(i)}} \ \forall\ i=2 \dots r$ are right singular vectors of $A$.

We can view the rows $A_{(i)}$ of $A$ as points in $\mathbb{R}^n$. Then, $\boldsymbol{v^{(1)}}$ is the best-fit line (that goes through the origin), or the best-fit one-dimensional subspace. This is represented by,
$\boldsymbol{v^{(1)}} = \arg\min_{\|v\|=1}\sum_{i+1}^{m}d(A_{(i)},\boldsymbol{v})^2$

Proof.
$\sum_{i}\|A_{(i)}\|^2 = \sum_{i}d(A_{(i)},\boldsymbol{v})^2 + \sum_{i}\|A_{(i)} \cdot \boldsymbol{v}\|^2$ is given by the Pythagorean Theorem. So, minimizing $\sum_{i}d(A_{(i)},\boldsymbol{v})^2$ requires that we proportionally maximize $\sum_{i}\|A_{(i)} \cdot \boldsymbol{v}\|^2 = \|A\boldsymbol{v}\|^2$. $\square$

Theorem 2.
$V_k = span\{\boldsymbol{v^{(1)}} ... \boldsymbol{v^{(k)}}\}$
Then, $V_k$ is the best-fit k-dimensional subspace for A,
$V_k = \arg\min_{V:dim(V) \leq k}\sum_{i}d(A_{(i)}, V)^2$

Proof.
We will do an induction on $k$.
$k = 1$: Lemma 1
Suppose Theorem 2 is true for (k-1)-dimensional subspace. So, $V_{k-1}$ is the best-fit (k-1)-dimensional subspace.
Let $W$ be the best k-dimensional subspace.
We know that $\exists \boldsymbol{w_k} \in W$ such that $\boldsymbol{w_k} \bot V_{k-1}$ since $dim(W \cap V_{k-1}) \leq k-1$.
So, $W = span\{\boldsymbol{w_k}, W_{k-1}\}$
$d(A, W)^2 = d(A, \boldsymbol{w_k})^2 + d(A, W_{k-1})^2$
$\geq d(A, \boldsymbol{w_k})^2 + d(A, V_{k-1})^2$ since $V_{k-1}$ is best (k-1)-dimensional subspace, by IH
$> d(A, \boldsymbol{v^{(k)}})^2 + d(A, V_{k-1})^2$ since $\boldsymbol{v^{(k)}}$ is the best among vectors orthogonal to $V_{k-1}$
$= d(A, V_k)^2$. $\square$

Proof of Theorem 1.
We want to show that $A = \sum_{i=1}^{n}\sigma_{i} \boldsymbol{u^{(i)}}\boldsymbol{v^{(i)}}^T$.

The LHS and RHS are both matrices. Since $LHS \cdot \boldsymbol{v^{(i)}} = A\boldsymbol{v^{(i)}} = \sigma_{i}\boldsymbol{u^{(i)}}$ and $RHS \cdot \boldsymbol{v^{(i)}} = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}}\boldsymbol{v^{(i)}} = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}$, we get that $A\boldsymbol{v} = (\sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}})\boldsymbol{v}$ for every $\boldsymbol{v} \in \mathbb{R}^n \implies A = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}}$