In last week’s lecture, we covered a basic gradient descent algorithm with a fixed . We also found the maximum number of iterations this algorithm can take, . Unfortunately, this is not polynomial in the length of the inputs. The length of the input is only , so the algorithm is exponential in the length of this input.
Fortunately, a slight change to the gradient descent algorithm can achieve polynomial runtime. The key idea behind this modification is to alter at each update to find the best value for the next step. The new gradient descent algorithm is:
- find that minimizes ,
- set .
This new algorithm requires stronger assumptions than the basic gradient descent algorithm. It requires the function to be not just convex, but strongly convex (recall the definition of -strongly convex from last week’s lecture, ). It also requires a similar upper bound for the Hessian.
Lemma. implies that
for all and .
Theorem 1. For an -strongly convex function , assume that , where is some arbitrary large value. Further suppose that for all ,
Then, the modified gradient descent algorithm finds a point such that in at most steps.
Proof. The proof of this is as follows:
which must be at most . So, , and . Therefore, the runtime of this algorithm is .
In the last portion of class on 8/28, we began Newton’s iteration, an optimization algorithm similar to gradient descent. Newton’s iteration is based on the Newton-Raphson method for finding the zeros of a function. Newton-Raphson iteration approximates a function by its gradient, then guesses where the gradient intersects the zero line as a likely zero, repeating this process until a sufficient approximation has been found. In one dimension, the update is
Newton’s iteration first notes that
is a close approximation of a function, since higher order terms will be relatively small and can be ignored without much effect. Finding the minimum of a function using this approximation requires finding the optimal choice of . Specifically, the optimal z will satisfy , which can be solved analytically by setting the gradient equal to 0, . This is solved by . Then, Newton’s iteration updates its estimate of the minimum by
Suppose has only real roots. could also be written as , with . Then, , and . can also be bounded by (lemma), since and .
Theorem 2. Suppose , where is some finite, though potentially very large, value. Then, after iterations, Newton’s algorithm will find such that .
This is called linear convergence is linear (). The convergence of the Newton method can be quadratic, when close enough to a root.
Assume that at and for all , . Then, . Now,
This is quadratic convergence, where after steps, implying after steps. Note that has not been assumed to be convex or polynomial, although this convergence only works for a sufficiently close starting point.
Moving back to optimization, Newton’s iteration works generally by updating . By the above proof, assuming and , Newton’s iteration has quadratic convergence from points close enough to the optimal.
In fact, Newton’s iteration is a fairly natural idea, since
- it is affine invariant (changing coordinate systems won’t affect the convergence)
- Newton step is the fastest descent direction taking the Hessian into account.