In last week’s lecture, we covered a basic gradient descent algorithm with a fixed . We also found the maximum number of iterations this algorithm can take,
. Unfortunately, this is not polynomial in the length of the inputs. The length of the
input is only
, so the algorithm is exponential in the length of this input.
Fortunately, a slight change to the gradient descent algorithm can achieve polynomial runtime. The key idea behind this modification is to alter at each update to find the best value for the next step. The new gradient descent algorithm is:
While ,
- find
that minimizes
,
- set
.
This new algorithm requires stronger assumptions than the basic gradient descent algorithm. It requires the function to be not just convex, but strongly convex (recall the definition of -strongly convex from last week’s lecture,
). It also requires a similar upper bound for the Hessian.
Lemma. implies that
for all and
.
Theorem 1. For an -strongly convex function
, assume that
, where
is some arbitrary large value. Further suppose that for all
,
Then, the modified gradient descent algorithm finds a point such that
in at most
steps.
Proof. The proof of this is as follows:
Then,
Now,
which must be at most . So,
, and
. Therefore, the runtime of this algorithm is
.
In the last portion of class on 8/28, we began Newton’s iteration, an optimization algorithm similar to gradient descent. Newton’s iteration is based on the Newton-Raphson method for finding the zeros of a function. Newton-Raphson iteration approximates a function by its gradient, then guesses where the gradient intersects the zero line as a likely zero, repeating this process until a sufficient approximation has been found. In one dimension, the update is
.
Newton’s iteration first notes that
is a close approximation of a function, since higher order terms will be relatively small and can be ignored without much effect. Finding the minimum of a function using this approximation requires finding the optimal choice of . Specifically, the optimal z will satisfy
, which can be solved analytically by setting the gradient equal to 0,
. This is solved by
. Then, Newton’s iteration updates its estimate of the minimum by
.
Suppose has only real roots.
could also be written as
, with
. Then,
, and
.
can also be bounded by
(lemma), since
and
.
Theorem 2. Suppose , where
is some finite, though potentially very large, value. Then, after
iterations, Newton’s algorithm will find
such that
.
Proof.
This is called linear convergence is linear (). The convergence of the Newton method can be quadratic, when close enough to a root.
Assume that at
and for all
,
. Then,
. Now,
Therefore,
so .
Then,
so .
Since
and .
This is quadratic convergence, where after steps,
implying
after
steps. Note that
has not been assumed to be convex or polynomial, although this convergence only works for a sufficiently close starting point.
Moving back to optimization, Newton’s iteration works generally by updating . By the above proof, assuming
and
, Newton’s iteration has quadratic convergence from points close enough to the optimal.
In fact, Newton’s iteration is a fairly natural idea, since
- it is affine invariant (changing coordinate systems won’t affect the convergence)
- Newton step is the fastest descent direction taking the Hessian into account.