Power Iteration

Recall the singular value decomposition (SVD) of a matrix A = UDV^{T} = \sum_{i=1}^{r}\sigma_i \boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^{T}}}.

We showed that V_{k} = span\{\boldsymbol{v^{(1)}}, ..., \boldsymbol{v^{(k)}}\} is the best-fit subspace for A, so

V_{k} = \arg\min_{V: dim(V) \leq k}\sum_{i=1}^{m}d(A_{(i)}, V)^2

We have that A_{(i)}V_{k}V_{k}^{T} is the projection of A_{(i)} to V_{k}, and so AV_{k}V_{k}^{T} is the projection of A to V_{k}.

Let A_k be defined as AV_{k}V_{k}^{T} = \sum_{i=1}^{k}\sigma_i \boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^{T}}}.

Theorem 1.
\|A-A_{k}\|_{F} = \min_{B: rank(B) \leq k}\|A-B\|_{F}
AND
\|A-A_{k}\|_{2} = \max_{\|\boldsymbol{v}\| = 1}\|(A-A_{k}) \cdot \boldsymbol{v}\| = \sigma_{k+1} = \min_{B: rank(B) \leq k}\|A-B\|_{2}

Proof.
We proved part (1) last lecture. We will prove part (2) here. Note that

A-A_{k} = \sum_{i=k+1}^{r}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^{T}}}
and
\|A-A_{k}\|_{2} = \sigma_{k+1}.

We need to show that for any B: rank(B)=k,
B=WZ^{T} where W, Z have \leq k columns and \|A-B\|_{2} \geq \sigma_{k+1}.

Since we know that Z has only k columns, \exists \boldsymbol{v} \in span\{\boldsymbol{v^{(1)}},...,\boldsymbol{v^{(k+1)}}\} such that
Z^{T}\boldsymbol{v} = 0,
\boldsymbol{v} = \sum_{i=1}^{k+1}\alpha_{i}\boldsymbol{v^{(i)}}, and
\|\boldsymbol{v}\|=1.

(A-B) \cdot \boldsymbol{v} = A \cdot \boldsymbol{v} = \sum_{i=1}^{k+1}\sigma_{i}\alpha_{i}\boldsymbol{u^{(i)}}
\|(A-B) \cdot \boldsymbol{v}\|^{2} = \sum_{i=1}^{k+1}\sigma_{i}^{2}\alpha_{i}^{2} \geq \sigma_{k+1}^{2}. \square

We’ve seen that the SVD is useful in more than one way. Now we will look at how to compute it, a central question in numerical analysis. A simple method, power iteration, is presented here:

Set \boldsymbol{x} \leftarrow e_i for i=1,2,\ldots,n
Repeat t times:{
\boldsymbol{x} \leftarrow A^{T}A\boldsymbol{x}
\boldsymbol{x} \leftarrow \frac{\boldsymbol{x}}{\|\boldsymbol{x}\|}
}

Theorem 2.
Assume (x\cdot v^{(1)})^2 \ge \frac{1}{n}. After t = \frac{\log n}{2\varepsilon} iterations, \boldsymbol{x} satisfies \|A\boldsymbol{x}\| \geq (1-\varepsilon) \sigma_{1}.

Proof.
\boldsymbol{x} = \sum_{i=1}^{r}\alpha_{i}\boldsymbol{v^{(i)}}
A^{T}A = \sum_{i=1}^{r}\sigma_{i}^{2}\boldsymbol{v^{(i)}}\boldsymbol{v^{(i)^{T}}}
(A^{T}A)^l = \sum_{i=1}^{r}\sigma_{i}^{2l}\boldsymbol{v^{(i)}}\boldsymbol{v^{(i)^{T}}}

\frac{\boldsymbol{x}^{T}(A^{T}A)^{l}A^{T}A(A^{T}A)^{l}\boldsymbol{x}}{\boldsymbol{x}^{T}(A^{T}A)^{l}(A^{T}A)^{l}\boldsymbol{x}} = \frac{\sum_{i=1}^{r}\sigma_{i}^{4l+2}\alpha_{i}^{2}}{\sum_{i=1}^{r}\sigma_{i}^{4l}\alpha_{i}^{2}}

Since \sigma_{1} \geq \sigma_{2} \geq ..., the weight on higher \sigma_{i} is increasing.

\sum \sigma_{i}^{4l}\alpha_{i}^{2} = \sum \sigma_{i}^{4l}(\alpha_{i}^{2})^{\frac{2l}{2l+1}}(\alpha_{i}^{2})^{\frac{1}{2l+1}}
\leq (\sum \sigma_{i}^{4l+2}\alpha_{i}^{2})^{\frac{2l}{2l+1}}(\sum \alpha_{i}^{2})^{\frac{1}{2l+1}}
= (\sum \sigma_{i}^{4l+2}\alpha_{i}^{2})^{\frac{2l}{2l+1}}

\therefore \frac{\boldsymbol{x^{l^{T}}}A^{T}A\boldsymbol{x^{l}}}{\boldsymbol{x^{l^{T}}}\boldsymbol{x^{l}}} \geq (\sum \sigma_{i}^{4l+2}\alpha_{i}^{2})^{\frac{1}{2l+1}} \geq \alpha_{1}^{\frac{2}{2l+1}}\sigma_{1}^{2}.

Therefore, after l iterations,
\frac{\|A\boldsymbol{x^{l}}\|^{2}}{\|\boldsymbol{x^{l}}\|^{2}} \geq (\frac{1}{n})^{\frac{1}{2l+1}}\sigma_{1}^{2}
= e^{-\frac{\ln n}{2l+1}}\sigma_{1}^{2}

Setting 2l + 1 = \frac{\ln n}{\varepsilon} \geq e^{-\varepsilon}\sigma_{1}^{2} \geq (1-\varepsilon)\sigma_{1}^{2}. \square

We note that the choice of the initial x in the algorithm, one of the axis vectors e_i, must satisfy the assumption that (e_i \cdot v^{(1)})^2 \ge \frac{1}{n}. Instead of trying all n basis vectors, we could also start with a random unit vector.

Claim.
\mathbb{E}(\alpha_{1}^{2}) = \frac{1}{n}

Proof.
\sum x_{i}^{2} = 1 symmetric.

So, \mathbb{E}(x_{1}^{2}) = \mathbb{E}(x_{2}^{2}) = ... = \mathbb{E}(x_{n}^{2}).

\implies n\mathbb{E}(x_{1}^{2}) = 1 \implies \mathbb{E}(x_{1}^{2}) = \alpha_{1}^{2} = \frac{1}{n}. \square

Moreover, \alpha_1^2 \ge \frac{1}{n} with some constant probability. So a few runs of the algorithm will suffice with high probability.

 

Advertisement