First we revisit the definition of a convex function. Previously, we have seen that a convex function is one that satisfies if is differentiable, or satisfies if is twice differentiable. However, both of these definitions require to be differentiable, and convex functions include a broader range of functions than just differentiable ones.

Convex functions are functions that satisfy . Intuitively, this definition says that the function value of a convex combination (a nonnegative weighted average) of any pair of points will have be at most the convex combination of the values at those points. This definition requires no differentiability assumption on .

The definition of a gradient can be generalized to analyze non-differentiable functions. A subgradient is the set of all possible vectors that could satisfy the previously seen definition of a gradient. Specifically, while a gradient satisfies , a subgradient of at is the set . For differentiable portions of , the subgradient is the gradient. However, at non-differentiable points in , the subgradient takes on a wider range of values, and defines all candidate planes with the entire function on one side of the plane.

For example, the linear program

can be reformulated as , where if , and otherwise. Here, is not differentiable, and so neither is . However, this function can still be shown to be convex, using our new definition of convex. For a pair , if both and fall in the feasible region of the program, then will be zero for both, and , since any nonnegative combination of is also feasible. Otherwise, for at least one of or . In this case, , so . More generally, we could replace with any convex set, i.e., a set s.t. for any pair the segment .

This suggests a new form of gradient descent algorithm, projected (proximal) gradient descent. The algorithm takes a usual gradient descent step, then projects the predicted point to the nearest point in the feasible set. Specifically, the algorithm updates its estimate by .

**Theorem 1.** Given , with convex, differentiable, and Lipshitz (meaning ), after iterations, or (implying that is a minimum of ).

To see the claim in parenthesis above, note that

. Then,

Therefore, , implying that is a minimum.

**Lemma 1.** . We have seen the proof for this last week.

**Lemma 2.**

**Proof of Theorem 1.** By Lemma 2, with , . By Lemma 2, with ,

, then, rearranging the parts with and simplifying the expression yields

Then, summing over yields , so

.

Then, after iterations, .

**Proof of Lemma 2.** Note that ,

, and

.

Then, using these,

as claimed.