First we revisit the definition of a convex function. Previously, we have seen that a convex function is one that satisfies if
is differentiable, or satisfies
if
is twice differentiable. However, both of these definitions require
to be differentiable, and convex functions include a broader range of functions than just differentiable ones.
Convex functions are functions that satisfy . Intuitively, this definition says that the function value of a convex combination (a nonnegative weighted average) of any pair of points will have be at most the convex combination of the values at those points. This definition requires no differentiability assumption on
.
The definition of a gradient can be generalized to analyze non-differentiable functions. A subgradient is the set of all possible vectors that could satisfy the previously seen definition of a gradient. Specifically, while a gradient satisfies , a subgradient of
at
is the set
. For differentiable portions of
, the subgradient is the gradient. However, at non-differentiable points in
, the subgradient takes on a wider range of values, and defines all candidate planes with the entire function on one side of the plane.
For example, the linear program
can be reformulated as , where
if
, and
otherwise. Here,
is not differentiable, and so neither is
. However, this function can still be shown to be convex, using our new definition of convex. For a pair
, if both
and
fall in the feasible region of the program, then
will be zero for both, and
, since any nonnegative combination of
is also feasible. Otherwise,
for at least one of
or
. In this case,
, so
. More generally, we could replace
with any convex set, i.e., a set
s.t. for any pair
the segment
.
This suggests a new form of gradient descent algorithm, projected (proximal) gradient descent. The algorithm takes a usual gradient descent step, then projects the predicted point to the nearest point in the feasible set. Specifically, the algorithm updates its estimate by .
Theorem 1. Given , with
convex,
differentiable, and
Lipshitz (meaning
), after
iterations,
or
(implying that
is a minimum of
).
To see the claim in parenthesis above, note that
. Then,
Therefore, , implying that
is a minimum.
Lemma 1. . We have seen the proof for this last week.
Lemma 2.
Proof of Theorem 1. By Lemma 2, with ,
. By Lemma 2, with
,
, then, rearranging the parts with
and simplifying the expression yields
Then, summing over yields
, so
.
Then, after iterations,
.
Proof of Lemma 2. Note that ,
, and
.
Then, using these,
as claimed.