8/28-8/30: Lecture Notes

In last week’s lecture, we covered a basic gradient descent algorithm with a fixed \eta . We also found the maximum number of iterations this algorithm can take, \frac{D^2 L^2}{\epsilon ^2} . Unfortunately, this is not polynomial in the length of the inputs. The length of the \epsilon input is only \log \epsilon , so the algorithm is exponential in the length of this input.

Fortunately, a slight change to the gradient descent algorithm can achieve polynomial runtime. The key idea behind this modification is to alter \eta at each update to find the best value for the next step. The new gradient descent algorithm is:

While \|\nabla f(x)\|^2 > \alpha\epsilon ,

  • find \delta that minimizes f(x_t - \delta\nabla f(x)) ,
  • set x^{t+1} = x^t - \delta\nabla f(x) .

This new algorithm requires stronger assumptions than the basic gradient descent algorithm. It requires the function to be not just convex, but strongly convex (recall the definition of \alpha -strongly convex from last week’s lecture, \forall y, y^\mathsf{T} \nabla^2 f(x)y > \alpha \|y\|^2 ). It also requires a similar upper bound for the Hessian.

Lemma. \forall y, y^\mathsf{T} \nabla^2 f(x)y < \beta \|y\|^2 implies that

f(x) + \nabla f(x) (y-x) + \alpha \|y-x\|^2 \leq f(y) \leq f(x) + \nabla f(x) (y-x) + \beta \|y-x\|^2

for all x and y .

Theorem 1. For an \alpha -strongly convex function f , assume that f(x^0) \leq \min f(x) + A , where A is some arbitrary large value.  Further suppose that for all x, y ,

\alpha \|y\|^2 \leq y^\mathsf{T} \nabla^2 f(x) y \leq \beta \|y\|^2

Then, the modified gradient descent algorithm finds a point x such that f(x) \leq \min f(x) + \epsilon in at most O(\frac{\beta}{\alpha} \log \frac{A}{\epsilon}) steps.

Proof. The proof of this is as follows:

\begin{aligned}  f(y) &\geq f(x) + \nabla f(x) (y-x) + \frac{\alpha}{2} \|y-x\|^2 \\  &\geq f(x) - \frac{1}{\alpha} \|\nabla f(x)\|^2 + \frac{1}{2\alpha} \|\nabla f(x)\|^2 \\  &\geq f(x) - \frac{1}{2\alpha} \|\nabla f(x)\|^2 .  \end{aligned}


\begin{aligned}  f(x^{t+1}) & \leq f(x^t) + \nabla f(x^t) (x^{t+1} - x^t) + \frac{\beta}{2} \|x^{t+1}-x^t\|^2\\  &\leq f(x^t) - \delta \|\nabla f(x^t)\|^2 + \frac{\beta}{2}\delta^2 \|\nabla f(x^t)\|^2\\  &\leq f(x^t) - \frac{1}{2\beta} \|\nabla f(x)\|^2\\  &\leq f(x^t) - \frac{1}{2\beta}(2\alpha)(f(x^t) - f(x^*))  \end{aligned}


\begin{aligned}  f(x^{t+1} - f(x^*)) &\leq (f(x^t) - f(x^*)) - \frac{\alpha}{\beta} (f(x^t) - f(x^*)) \\  &\leq (1-\frac{\alpha}{\beta}) (f(x^t) - f(x^*)) \\  &\leq (1-\frac{\alpha}{\beta})^t (f(x^0) - f(x^*)) \leq (1-\frac{\alpha}{\beta})^t A  \end{aligned}

which must be at most \epsilon .  So, (1-\frac{\alpha}{\beta})^tA \leq \epsilon , and t \geq \frac{\beta}{\alpha} \log (\frac{A}{\epsilon}) .  Therefore, the runtime of this algorithm is O(\frac{\beta}{\alpha} \log \frac{A}{\epsilon}) .

In the last portion of class on 8/28, we began Newton’s iteration, an optimization algorithm similar to gradient descent. Newton’s iteration is based on the Newton-Raphson method for finding the zeros of a function. Newton-Raphson iteration approximates a function by its gradient, then guesses where the gradient intersects the zero line as a likely zero, repeating this process until a sufficient approximation has been found. In one dimension, the update is

x_{t+1} = x^t - \frac{f(x)}{f'(x)} .

Newton’s iteration first notes that

f(y) = f(x) + \nabla f(x)(y-x) + \frac{1}{2} (y-x)^\mathsf{T}(\nabla^2 f(x))(y-x)

is a close approximation of a function, since higher order terms will be relatively small and can be ignored without much effect. Finding the minimum of a function using this approximation requires finding the optimal choice of y-x = z . Specifically, the optimal z will satisfy \min g(z) = \nabla f(x) z + \frac{1}{2} z^\mathsf{T} (\nabla^2 f(x))z, which can be solved analytically by setting the gradient equal to 0, \nabla g(z) = \nabla f(x) + (\nabla^2 f(x))z = 0 . This is solved by z = -(\nabla^2 f(x))^{-1} \nabla f(x) . Then, Newton’s iteration updates its estimate of the minimum by

x^{t+1} = x^t - (\nabla^2 f(x))^{-1} \nabla f(x) .

Suppose f(x) = \sum_{i=1}^n a_i x_i has only real roots. f(x) could also be written as f(x) \prod_i (x-\lambda_i) , with \lambda_n \leq \lambda_{n-1} \leq ... \leq \lambda_1 . Then, f'(x) = \sum_i \prod_{j \neq i} (x-\lambda_i) , and \frac{f(x)}{f'(x)} = \frac{1}{\sum_i \frac{1}{x-\lambda_i}} . \frac{f(x)}{f'(x)} can also be bounded by \frac{1}{n}(x-\lambda_1) \leq \frac{f(x)}{f'(x)} \leq x-\lambda_1 (lemma), since \frac{1}{\sum_i \frac{1}{x-\lambda_i}} \geq \frac{1}{n}(x-\lambda_1) and \frac{1}{\sum_i \frac{1}{x-\lambda_i}} \leq \frac{1}{\frac{1}{x-\lambda_1}} = x-\lambda_1 .

Theorem 2. Suppose \lambda_1 \leq x^0 \leq \lambda_1 + A, where A is some finite, though potentially very large, value. Then, after O(n\log(\frac{A}{\epsilon})) iterations, Newton’s algorithm will find x^t such that \lambda_1 \leq x^t \leq \lambda_1 + \epsilon .


\begin{aligned}  x^t - \lambda_1 &\leq (1-\frac{1}{n})(x^{t-1} - \lambda_1) \\  &\leq (1-\frac{1}{n})^t (x^0 - \lambda_1) \\  &= (1-\frac{1}{n})^t A \leq \epsilon  \end{aligned}

This is called linear convergence is linear  (\log\frac{1}{\epsilon} ). The convergence of the Newton method can be quadratic, when close enough to a root.

Assume that |f'(x^*)| \geq \alpha at f(x^*) = 0 and for all x,y , |f'(y) - f'(x)| \leq L|y-x| . Then, x^{t+1} - x^* = x^t - x^* - \frac{f(x^t)}{f'(x^t)} . Now,

\begin{aligned}  f(x^*) &= f(x^t) + \int_{x^t}^{x^*} f'(z)dz \\  &= f(x^t) + f'(x^t)(x^* - x^t) + \int_{x^t}^{x^*} (f'(z) - f'(x^t))dz  \end{aligned}

Therefore, x^{t+1} - x^* = -\frac{1}{f'(x^t)} \int_{x^t}^{x^*} (f'(z) - f'(x^t))dz

so |x^{t+1} - x^*| \leq \frac{L}{|f'(x^t)|} \int_{x^t}^{x^*} |z-x^t|dz = \frac{L|x^* - x^t|^2}{2|f'(x^t)|} .

Then, f'(x^t) \geq f'(x^*) - L|x^t-x^*|

so |x^{t+1} - x^*| < \frac{L}{2(\alpha - L|x^t-x^*|)}|x^t - x^*|^2 .

Since |x^t - x^*| \leq \frac{\alpha}{2L}

|x^{t+1} - x^*| \leq \frac{L}{\alpha} |x^t - x^*|^2 < \frac{1}{2} |x^t - x^*|

and \frac{L}{\alpha}|x^{t+1}-x^*| \leq (\frac{L}{\alpha}|x^t-x^*|)^2 .

This is quadratic convergence, where after t steps, \frac{L}{\alpha} |x^t - x^*| \leq (\frac{L}{\alpha} \epsilon_0)^{2^t} implying |x^t - x^*| < \epsilon \frac{\alpha}{L} after \log\log(\frac{\alpha}{L\epsilon_0}) steps. Note that f has not been assumed to be convex or polynomial, although this convergence only works for a sufficiently close starting point.

Moving back to optimization, Newton’s iteration works generally by updating x^{t+1} = x^t - (\nabla^2f(x^t))^{-1}\nabla f(x^t) . By the above proof, assuming \|(\nabla^2f(x^*)^{-1}\| \leq \frac{1}{\alpha} and \|\nabla^2f(y) - \nabla^2f(x)\| \leq L\|y-x\|^2 , Newton’s iteration has quadratic convergence from points close enough to the optimal.

In fact, Newton’s iteration is a fairly natural idea, since

  1. it  is affine invariant (changing coordinate systems won’t affect the convergence)
  2. Newton step is the fastest descent direction taking the Hessian into account.