# 8/28-8/30: Lecture Notes

In last week’s lecture, we covered a basic gradient descent algorithm with a fixed $\eta$. We also found the maximum number of iterations this algorithm can take, $\frac{D^2 L^2}{\epsilon ^2}$. Unfortunately, this is not polynomial in the length of the inputs. The length of the $\epsilon$ input is only $\log \epsilon$, so the algorithm is exponential in the length of this input.

Fortunately, a slight change to the gradient descent algorithm can achieve polynomial runtime. The key idea behind this modification is to alter $\eta$ at each update to find the best value for the next step. The new gradient descent algorithm is:

While $\|\nabla f(x)\|^2 > \alpha\epsilon$,

• find $\delta$ that minimizes $f(x_t - \delta\nabla f(x))$,
• set $x^{t+1} = x^t - \delta\nabla f(x)$.

This new algorithm requires stronger assumptions than the basic gradient descent algorithm. It requires the function to be not just convex, but strongly convex (recall the definition of $\alpha$-strongly convex from last week’s lecture, $\forall y, y^\mathsf{T} \nabla^2 f(x)y > \alpha \|y\|^2$). It also requires a similar upper bound for the Hessian.

Lemma. $\forall y, y^\mathsf{T} \nabla^2 f(x)y < \beta \|y\|^2$ implies that

$f(x) + \nabla f(x) (y-x) + \alpha \|y-x\|^2 \leq f(y) \leq f(x) + \nabla f(x) (y-x) + \beta \|y-x\|^2$

for all $x$ and $y$.

Theorem 1. For an $\alpha$-strongly convex function $f$, assume that $f(x^0) \leq \min f(x) + A$, where $A$ is some arbitrary large value.  Further suppose that for all $x, y$,

$\alpha \|y\|^2 \leq y^\mathsf{T} \nabla^2 f(x) y \leq \beta \|y\|^2$

Then, the modified gradient descent algorithm finds a point $x$ such that $f(x) \leq \min f(x) + \epsilon$ in at most $O(\frac{\beta}{\alpha} \log \frac{A}{\epsilon})$ steps.

Proof. The proof of this is as follows:

\begin{aligned} f(y) &\geq f(x) + \nabla f(x) (y-x) + \frac{\alpha}{2} \|y-x\|^2 \\ &\geq f(x) - \frac{1}{\alpha} \|\nabla f(x)\|^2 + \frac{1}{2\alpha} \|\nabla f(x)\|^2 \\ &\geq f(x) - \frac{1}{2\alpha} \|\nabla f(x)\|^2 . \end{aligned}

Then,

\begin{aligned} f(x^{t+1}) & \leq f(x^t) + \nabla f(x^t) (x^{t+1} - x^t) + \frac{\beta}{2} \|x^{t+1}-x^t\|^2\\ &\leq f(x^t) - \delta \|\nabla f(x^t)\|^2 + \frac{\beta}{2}\delta^2 \|\nabla f(x^t)\|^2\\ &\leq f(x^t) - \frac{1}{2\beta} \|\nabla f(x)\|^2\\ &\leq f(x^t) - \frac{1}{2\beta}(2\alpha)(f(x^t) - f(x^*)) \end{aligned}

Now,

\begin{aligned} f(x^{t+1} - f(x^*)) &\leq (f(x^t) - f(x^*)) - \frac{\alpha}{\beta} (f(x^t) - f(x^*)) \\ &\leq (1-\frac{\alpha}{\beta}) (f(x^t) - f(x^*)) \\ &\leq (1-\frac{\alpha}{\beta})^t (f(x^0) - f(x^*)) \leq (1-\frac{\alpha}{\beta})^t A \end{aligned}

which must be at most $\epsilon$.  So, $(1-\frac{\alpha}{\beta})^tA \leq \epsilon$, and $t \geq \frac{\beta}{\alpha} \log (\frac{A}{\epsilon})$.  Therefore, the runtime of this algorithm is $O(\frac{\beta}{\alpha} \log \frac{A}{\epsilon})$.

In the last portion of class on 8/28, we began Newton’s iteration, an optimization algorithm similar to gradient descent. Newton’s iteration is based on the Newton-Raphson method for finding the zeros of a function. Newton-Raphson iteration approximates a function by its gradient, then guesses where the gradient intersects the zero line as a likely zero, repeating this process until a sufficient approximation has been found. In one dimension, the update is

$x_{t+1} = x^t - \frac{f(x)}{f'(x)}$.

Newton’s iteration first notes that

$f(y) = f(x) + \nabla f(x)(y-x) + \frac{1}{2} (y-x)^\mathsf{T}(\nabla^2 f(x))(y-x)$

is a close approximation of a function, since higher order terms will be relatively small and can be ignored without much effect. Finding the minimum of a function using this approximation requires finding the optimal choice of $y-x = z$. Specifically, the optimal z will satisfy $\min g(z) = \nabla f(x) z + \frac{1}{2} z^\mathsf{T} (\nabla^2 f(x))z$, which can be solved analytically by setting the gradient equal to 0, $\nabla g(z) = \nabla f(x) + (\nabla^2 f(x))z = 0$. This is solved by $z = -(\nabla^2 f(x))^{-1} \nabla f(x)$. Then, Newton’s iteration updates its estimate of the minimum by

$x^{t+1} = x^t - (\nabla^2 f(x))^{-1} \nabla f(x)$.

Suppose $f(x) = \sum_{i=1}^n a_i x_i$ has only real roots. $f(x)$ could also be written as $f(x) \prod_i (x-\lambda_i)$, with $\lambda_n \leq \lambda_{n-1} \leq ... \leq \lambda_1$. Then, $f'(x) = \sum_i \prod_{j \neq i} (x-\lambda_i)$, and $\frac{f(x)}{f'(x)} = \frac{1}{\sum_i \frac{1}{x-\lambda_i}}$. $\frac{f(x)}{f'(x)}$ can also be bounded by $\frac{1}{n}(x-\lambda_1) \leq \frac{f(x)}{f'(x)} \leq x-\lambda_1$ (lemma), since $\frac{1}{\sum_i \frac{1}{x-\lambda_i}} \geq \frac{1}{n}(x-\lambda_1)$ and $\frac{1}{\sum_i \frac{1}{x-\lambda_i}} \leq \frac{1}{\frac{1}{x-\lambda_1}} = x-\lambda_1$.

Theorem 2. Suppose $\lambda_1 \leq x^0 \leq \lambda_1 + A$, where $A$ is some finite, though potentially very large, value. Then, after $O(n\log(\frac{A}{\epsilon}))$ iterations, Newton’s algorithm will find $x^t$ such that $\lambda_1 \leq x^t \leq \lambda_1 + \epsilon$.

Proof.

\begin{aligned} x^t - \lambda_1 &\leq (1-\frac{1}{n})(x^{t-1} - \lambda_1) \\ &\leq (1-\frac{1}{n})^t (x^0 - \lambda_1) \\ &= (1-\frac{1}{n})^t A \leq \epsilon \end{aligned}

This is called linear convergence is linear  ($\log\frac{1}{\epsilon}$). The convergence of the Newton method can be quadratic, when close enough to a root.

Assume that $|f'(x^*)| \geq \alpha$ at $f(x^*) = 0$ and for all $x,y$, $|f'(y) - f'(x)| \leq L|y-x|$. Then, $x^{t+1} - x^* = x^t - x^* - \frac{f(x^t)}{f'(x^t)}$. Now,

\begin{aligned} f(x^*) &= f(x^t) + \int_{x^t}^{x^*} f'(z)dz \\ &= f(x^t) + f'(x^t)(x^* - x^t) + \int_{x^t}^{x^*} (f'(z) - f'(x^t))dz \end{aligned}

Therefore, $x^{t+1} - x^* = -\frac{1}{f'(x^t)} \int_{x^t}^{x^*} (f'(z) - f'(x^t))dz$

so $|x^{t+1} - x^*| \leq \frac{L}{|f'(x^t)|} \int_{x^t}^{x^*} |z-x^t|dz = \frac{L|x^* - x^t|^2}{2|f'(x^t)|}$.

Then, $f'(x^t) \geq f'(x^*) - L|x^t-x^*|$

so $|x^{t+1} - x^*| < \frac{L}{2(\alpha - L|x^t-x^*|)}|x^t - x^*|^2$.

Since $|x^t - x^*| \leq \frac{\alpha}{2L}$

$|x^{t+1} - x^*| \leq \frac{L}{\alpha} |x^t - x^*|^2 < \frac{1}{2} |x^t - x^*|$

and $\frac{L}{\alpha}|x^{t+1}-x^*| \leq (\frac{L}{\alpha}|x^t-x^*|)^2$.

This is quadratic convergence, where after $t$ steps, $\frac{L}{\alpha} |x^t - x^*| \leq (\frac{L}{\alpha} \epsilon_0)^{2^t}$ implying $|x^t - x^*| < \epsilon \frac{\alpha}{L}$ after $\log\log(\frac{\alpha}{L\epsilon_0})$ steps. Note that $f$ has not been assumed to be convex or polynomial, although this convergence only works for a sufficiently close starting point.

Moving back to optimization, Newton’s iteration works generally by updating $x^{t+1} = x^t - (\nabla^2f(x^t))^{-1}\nabla f(x^t)$. By the above proof, assuming $\|(\nabla^2f(x^*)^{-1}\| \leq \frac{1}{\alpha}$ and $\|\nabla^2f(y) - \nabla^2f(x)\| \leq L\|y-x\|^2$, Newton’s iteration has quadratic convergence from points close enough to the optimal.

In fact, Newton’s iteration is a fairly natural idea, since

1. it  is affine invariant (changing coordinate systems won’t affect the convergence)
2. Newton step is the fastest descent direction taking the Hessian into account.