8/21-8/23: Gradient Descent

The main idea of gradient descent for minimizing a function is to move in the direction of the negative gradient.  In order to be useful as an algorithm, a stopping condition is also required.  Ideally, the algorithm would stop at a local minimum, but the algorithm cannot directly observe when it is at the local minimum.  Instead, the algorithm should stop updating when the gains from updating or the gradient becomes sufficiently small.

Gradient descent updates its estimate of the minimum according to a parameter \eta .  Specifically, if x_t is the algorithm’s estimate of the minimum at time t , then the algorithm updates itself using

x_{t+1} = x_t - \eta \nabla f(x_t).

Using this update, when the algorithm gets close to the minimum, it may jump past the minimum, then follow the gradient and jump back to its estimate from the previous step.  Because of this, \eta is often decreased as the algorithm runs.  This continues to run until the differences between successive steps is within some margin of error \epsilon.

For general functions, this technique isn’t guaranteed to find the global minimum, nor will any algorithm which doesn’t query every possible input for the function.  It’s always possible that some unqueried region contains the minimum.  To account for this case, we will assume that the function we are trying to minimize is convex.  This means that the entire function sits on or above the tangent line or plane at any point.  More precisely:

A function {\bf R}^n \rightarrow {\bf R} is convex if for any two points x,y \in {\bf R}^n, f(y) \geq f(x) + (y-x) \nabla f(x).

For any convex function, a local minimum must also be the global minimum (since at any minimum m , the gradient is 0, so applying the above definition yields f(x) \geq f(m) for any point x in the domain, implying m is the global minimum).

Another characteristic of convex functions is that, if a function f is twice differentiable, then f is convex if and only if the second derivative of f is always nonnegative (since f(y) = f(x) + \int_x^y f'(z)dz and f'(z) = f'(x) + \int_x^z f''(w)dw , and since f''(w) \geq 0 , f'(z) \geq f'(x) and so f(y) \ge f(x) + (y-x)f'(x)).

The concept of the second derivative being positive can be generalized to higher dimensions: the Hessian (matrix of second-order partial derivatives of the function) is Positive SemiDefinite (PSD), meaning that all eigenvalues are nonnegative (positive semidefinite is a generalization of being nonnegative to matrices).  Equivalently, a matrix A (in this case A = \nabla^2 f(x)) is PSD iff for all y \in {\bf R}^ny^\mathsf{T}Ay \geq 0.

Another problem that affects gradient descent algorithm is the possibility of overshooting the minimum and ending up relatively far up a particularly steep section of the function.  In order to avoid this, a second assumption is made: that the gradient is bounded (\|\nabla f(x)\| \leq L , where L is the bound on the gradient).  Then, some bounds can be put on the performance of the algorithm.  Let x^*  be a minimum.  Then,

\|x_{t+1} - x^*\|^2 = \|x_t-x^*-\eta \nabla f(x)\|^2 = \|x_t-x^*\|^2 - 2\eta \nabla f(x_t)(x_t-x^*) + \eta^2\|\nabla f(x)\|^2 \leq \|x_t-x^*\|^2 - 2\eta (f(x_t)-f(x^*)) + \eta^2 L^2.

If f(x_t) - f(x^*) > \epsilon (meaning the algorithm will continue), then this is also \leq \|x_t - x^*\|^2 - 2\eta\epsilon + \eta^2 L^2 .  Then, to maximize the reduction in f , set \eta = \frac{\epsilon}{L^2} .  Furthermore, if the domain is finite and of diameter D (meaning for any pair of points x,y in the domain, \|x-y\| \leq D ), then the maximum number of iterations the algorithm can take is \frac{D^2 L^2}{\epsilon ^2} . We can either output the point with the minimum function value among all the points encountered, or stop when the norm of the gradient is at most \epsilon/D and output the last point.

The concept of convexity can also be refined.  While convex functions have a nonnegative Hessian, a strongly convex function has a strictly positive Hessian.  A function f is \alpha -strongly convex if, for all x in the domain, x^\mathsf{T}(\nabla^2 f(x))x \geq \alpha ||x||^2 .  From this, f being \alpha -strongly convex implies that f(y) \geq f(x) + \nabla f(x)(y-x) + \frac{\alpha}{2}\|y-x\|^2 .

An proof of the above fact follows from the intermediate value theorem: For any x,y there exists a z \in [x,y] s.t. f(y)=f(x)+(y-x)\cdot \nabla f(x)+\frac{1}{2}(y-x)^\mathsf{T}\nabla^2 f(z) (y-x). (Why?)

At the end of the lecture, we saw a second variant of gradient descent.  This algorithm operates by repeatedly finding a scalar t such that f(x - t\nabla f(x) is minimized, then updating the current x value by x = x - t\nabla f(x) until \|\nabla f(x)\|^2 \leq \epsilon\alpha. We will analyze this algorithm in the next lecture.