Singular Value Decomposition

Last time we saw that \min \|Ax-b\| is achieved by x = A^{+}b where A^{+} = VD^{-1}U^T depended on the singular value decomposition of A.

This time we will explore singular value decomposition further.

Theorem 1.
Any m \times n real matrix A can be written as A = UDV^T where U,V have orthonormal columns and D is a diagonal matrix with entries \sigma_1 \geq \dots \geq \sigma_r > 0 where r = rank(A).

AV = UDV^{T}V = UD
A^{T}U = VD
or equivalently,
A\boldsymbol{v^{(i)}} = \sigma_i \boldsymbol{u^{(i)}}
A^{T}\boldsymbol{u^{(i)}} = \sigma_i \boldsymbol{v^{(i)}}

\boldsymbol{u^{(i)}} is a left singular vector
\boldsymbol{v^{(i)}} is a right singular vector

Lemma 1.
\boldsymbol{v^{(1)}} = \arg\max_{\|\boldsymbol{v}\|=1}\|A\boldsymbol{v}\|^2

Proof. (Linear Algebra)
A^{T}A\boldsymbol{v} = \sigma A^{T}\boldsymbol{u} = \sigma^2 \boldsymbol{v}

So, each \boldsymbol{v^{(i)}} is an eigenvector of A^{T}A
and each \boldsymbol{u^{(i)}} is an eigenvector of AA^{T}

For any eigenvector \boldsymbol{v} of A^{T}A,
A^{T}A\boldsymbol{v} = \lambda\boldsymbol{v}
define \boldsymbol{u} = \frac{1}{\sqrt{\lambda}}A\boldsymbol{v}

Then, A\boldsymbol{v} = \sqrt{\lambda}u
and A^{T}\boldsymbol{u} = \sqrt{\lambda}v.

So \arg\max_{\|\boldsymbol{v}\|\leq 1}\|A\boldsymbol{v}\|^2 is the top eigenvector of A^{T}A, and hence, the top right singular vector of A. \square

Proof. (Calculus)
Consider the objective function f(\boldsymbol{v}) = \|A\boldsymbol{v}\|^2 = \boldsymbol{v^{T}}A^{T}A\boldsymbol{v}.

Then, \nabla f(\boldsymbol{v}) = 2A^{T}A\boldsymbol{v}.

Intuitively we can see that if \boldsymbol{v} is a point on the surface of a sphere, defined by \|v\| \leq 1, then at any local optimum \boldsymbol{u'}, \nabla f(\boldsymbol{u'}) is parallel to \boldsymbol{u'}. The inverse is also true.


This relationship is represented by 2A^{T}A\boldsymbol{v} = \lambda \boldsymbol{v}.

By the above, we can see that all local optima of \|A\boldsymbol{v}\|^2 are eigenvectors. So, \max \|A\boldsymbol{v}\|^2 is achieved by an eigenvector of A^{T}A, or a right singular vector of A. \square

Can we reason similarly about \boldsymbol{v^{(i)}}, \boldsymbol{u^{(i)}}\ \forall\ i=2 \dots r?

Lemma 2.
\boldsymbol{v^{(1)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1}\|A\boldsymbol{v}\|^2
\boldsymbol{v^{(2)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1 \atop \boldsymbol{v} \bot \boldsymbol{v^{(1)}}}\|A\boldsymbol{v}\|^2

\boldsymbol{v^{(k)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1 \atop \boldsymbol{v} \bot \boldsymbol{v^{(1)}} ... \boldsymbol{v^{(k-1)}}}\|A\boldsymbol{v}\|^2

Then, \boldsymbol{v^{(i)}} \ \forall\ i=2 \dots r are right singular vectors of A.

We can view the rows A_{(i)} of A as points in \mathbb{R}^n. Then, \boldsymbol{v^{(1)}} is the best-fit line (that goes through the origin), or the best-fit one-dimensional subspace. This is represented by,
\boldsymbol{v^{(1)}} = \arg\min_{\|v\|=1}\sum_{i+1}^{m}d(A_{(i)},\boldsymbol{v})^2


\sum_{i}\|A_{(i)}\|^2 = \sum_{i}d(A_{(i)},\boldsymbol{v})^2 + \sum_{i}\|A_{(i)} \cdot \boldsymbol{v}\|^2 is given by the Pythagorean Theorem. So, minimizing \sum_{i}d(A_{(i)},\boldsymbol{v})^2 requires that we proportionally maximize \sum_{i}\|A_{(i)} \cdot \boldsymbol{v}\|^2 = \|A\boldsymbol{v}\|^2. \square


Theorem 2.
V_k = span\{\boldsymbol{v^{(1)}} ... \boldsymbol{v^{(k)}}\}
Then, V_k is the best-fit k-dimensional subspace for A,
V_k = \arg\min_{V:dim(V) \leq k}\sum_{i}d(A_{(i)}, V)^2

We will do an induction on k.
k = 1: Lemma 1
Suppose Theorem 2 is true for (k-1)-dimensional subspace. So, V_{k-1} is the best-fit (k-1)-dimensional subspace.
Let W be the best k-dimensional subspace.
We know that \exists \boldsymbol{w_k} \in W such that \boldsymbol{w_k} \bot V_{k-1} since dim(W \cap V_{k-1}) \leq k-1.
So, W = span\{\boldsymbol{w_k}, W_{k-1}\}
d(A, W)^2 = d(A, \boldsymbol{w_k})^2 + d(A, W_{k-1})^2
\geq d(A, \boldsymbol{w_k})^2 + d(A, V_{k-1})^2 since V_{k-1} is best (k-1)-dimensional subspace, by IH
> d(A, \boldsymbol{v^{(k)}})^2 + d(A, V_{k-1})^2 since \boldsymbol{v^{(k)}} is the best among vectors orthogonal to V_{k-1}
= d(A, V_k)^2. \square

Proof of Theorem 1.
We want to show that A = \sum_{i=1}^{n}\sigma_{i} \boldsymbol{u^{(i)}}\boldsymbol{v^{(i)}}^T.

The LHS and RHS are both matrices. Since LHS \cdot \boldsymbol{v^{(i)}} = A\boldsymbol{v^{(i)}} = \sigma_{i}\boldsymbol{u^{(i)}} and RHS \cdot \boldsymbol{v^{(i)}} = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}}\boldsymbol{v^{(i)}} = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}, we get that A\boldsymbol{v} = (\sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}})\boldsymbol{v} for every \boldsymbol{v} \in \mathbb{R}^n \implies A = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}}