9-6 Lecture Notes

First we revisit the definition of a convex function. Previously, we have seen that a convex function is one that satisfies f(y) \geq f(x) + \nabla f(x) (y-x) if f is differentiable, or satisfies \nabla^2 f(x) \ge 0 if f is twice differentiable. However, both of these definitions require f to be differentiable, and convex functions include a broader range of functions than just differentiable ones.

Convex functions are functions that satisfy \forall x,y,\lambda\in(0,1) f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) . Intuitively, this definition says that the function value of a convex combination (a nonnegative weighted average) of any pair of points will have be at most the convex combination of the values at those points. This definition requires no differentiability assumption on f .

The definition of a gradient can be generalized to analyze non-differentiable functions. A subgradient is the set of all possible vectors that could satisfy the previously seen definition of a gradient. Specifically, while a gradient satisfies f(y) - f(x) \geq \nabla f(x) \cdot (y-x) , a subgradient of f at x is the set \{v : \forall y f(y) - f(x) \geq v\cdot (y-x)\} . For differentiable portions of f , the subgradient is the gradient. However, at non-differentiable points in f , the subgradient takes on a wider range of values, and defines all candidate planes with the entire function on one side of the plane.

For example, the linear program
\min \quad c^T x \\ s.t. \quad Ax \geq b
can be reformulated as \min f(x) = c^Tx + h(x) , where h(x) = 0 if Ax \geq b , and h(x) = \infty otherwise. Here, h is not differentiable, and so neither is f . However, this function can still be shown to be convex, using our new definition of convex. For a pair x,y , if both x and y fall in the feasible region of the program, then h will be zero for both, and h(\lambda x + (1-\lambda)y) = \lambda h(x) + (1-\lambda)h(y) , since any nonnegative combination of x,y is also feasible. Otherwise, h = \infty for at least one of x or y . In this case, \lambda h(x) + (1-\lambda)h(y) = \infty , so h(\lambda x + (1-\lambda)y) \leq \lambda h(x) + (1-\lambda)h(y) . More generally, we could replace Ax \ge b with any convex set, i.e., a set K s.t. for any pair x,y \in K the segment [x,y] \subseteq K.

This suggests a new form of gradient descent algorithm, projected (proximal) gradient descent. The algorithm takes a usual gradient descent step, then projects the predicted point to the nearest point in the feasible set. Specifically, the algorithm updates its estimate by x = argmin_w(h(w) + \frac{1}{2\eta} \|w-(x-\eta\nabla g(x))\|^2) = prox(x-\eta\nabla g(x)) .

Theorem 1. Given f(x) = g(x) + h(x) , with g,h convex, g differentiable, and \nabla g Lipshitz (meaning \|\nabla g(y) - \nabla g(x)\| \leq L\|y-x\| ), after t iterations, f(x^t) - f(x^*) \leq \frac{1}{2\eta t}(\|x^0-x^*\|^2) or x^{t+1} = x^t (implying that x^t is a minimum of f ).

To see the claim in parenthesis above, note that

x^{t+1} = argmin_w(h(w) + \frac{1}{2\eta} \|w - (x^t - \eta \nabla g(x))\|^2) . Then,

\begin{aligned}  0 &\in \partial (h(w) + \frac{1}{2\eta}\|w-(x-\eta\nabla g(x))\|^2) \\  &\in \partial h(w) + \frac{1}{\eta}(w-x+\eta\nabla g(x)) \\  &\in \partial h(w) + 0 + \nabla g(x) \implies 0 \in \partial h(w) + \nabla g(x) \end{aligned}
Therefore, 0 \in \partial h(x) + \nabla g(x) = \partial f(x) , implying that x is a minimum.

Lemma 1. \forall y, g(y) \leq g(x) + \nabla g(x) (y-x) + \frac{L}{2} \|y-x\|^2 .  We have seen the proof for this last week.

Lemma 2. \forall y, f(x^{t+1}) \leq f(y) + \frac{1}{\eta}(x^t-x^{t+1})(x^t-y) - \frac{1}{2\eta} \|x^t-x^{t+1}\|^2

Proof of Theorem 1. By Lemma 2, with y = x^t , f(x^{t+1}) \leq f(x^t) - \frac{1}{2\eta} \|x^t-x^{t+1}\|^2 .  By Lemma 2, with  y=x^* ,

f(x^{t+1}) \leq f(x^*) + \frac{1}{\eta}(x^t-x^{t+1})(x^t-x^*) - \frac{1}{2\eta} \|x^t-x^{t+1}\|^2  , then, rearranging the parts with \frac{1}{\eta} and simplifying the expression yields

f(x^{t+1} - f(x^*)) \leq \frac{1}{2\eta}(\|x^t-x^*\|^2 - \|x^{t+1}-x^*\|^2)

Then, summing over t=1,2,...,T yields T(f(x^T) - f(x^*)) \leq \frac{1}{2\eta} \|x^0-x^*\|^2, so

f(x^T) \leq f(x^*) + \frac{1}{2\eta T} \|x^0-x^*\|^2 .

Then, after T = \frac{1}{2\eta\epsilon}\|x^0-x^*\|^2 iterations, f(x^T) \leq f(x^*) + \epsilon .

Proof of Lemma 2. Note that g(y) \leq g(x) + \nabla g(x) \cdot (y-x) + \frac{L}{2} \|y-x\|^2 ,

h(y) \geq h(x) + \partial h(x)\cdot (y-x) , and

\frac{1}{\eta} (x^t - x^{t+1}) - \nabla g(x) \in \partial h(x^{t+1}) .

Then, using these, \begin{aligned}  f(x^{t+1}) &\leq g(x^t) - \nabla g(x^t)\cdot (x^t - x^{t+1}) + \frac{L}{2}\|x^t - x^{t+1}\|^2 + h(x^{t+1})\\  &\leq g(y) - \nabla g(x^t)\cdot(y-x^t) - \nabla g(x^t)\cdot (x^t - x^{t+1}) + \frac{1}{2\eta}\|x^t-x^{t+1}\|^2 \\  &\qquad+ h(y) - \left(\frac{1}{\eta}(x^t - x^{t+1}) - \nabla g(x^t))\cdot (y-x^{t+1}\right)\\  &=f(y) - \frac{1}{\eta} (x^t-x^{t+1})\cdot (y-x^{t}) - \frac{1}{2\eta} \|x^t - x^{t+1}\|^2  \end{aligned}

as claimed.