## Interior Point Method

We study another class of optimization algorithms: interior point methods.  These require a more specific structure than general convex optimization, but can still be used beyond linear programming.  These algorithms also work by following an interior path, determined by successive minima of a modified objective function. We will study in complete detail, one of the simplest variants, the primal-dual interior-point method.

Consider the primal program (P)

$\min \boldsymbol{c^T}\boldsymbol{x}$
$s.t. A\boldsymbol{x} = \boldsymbol{b}, x \geq 0$

with corresponding dual (D)

$\max \boldsymbol{b^T}\boldsymbol{y}$
$s.t. A^T\boldsymbol{y} + \boldsymbol{s}= \boldsymbol{c}, s \geq 0$

Note that (P) $\geq$ (D).  Then, $0 \leq \boldsymbol{c}^T\boldsymbol{x} - \boldsymbol{b}^T\boldsymbol{y} = \boldsymbol{c}^T\boldsymbol{x} - \boldsymbol{x}^TA^T\boldsymbol{y} = \boldsymbol{c}^T\boldsymbol{x} - \boldsymbol{x}^T(\boldsymbol{c}-\boldsymbol{s}) = \boldsymbol{x}^T\boldsymbol{s}$.  If the complimentary slackness conditions hold, this is achieved with equality, indicating that either each primal variable is 0, or it’s corresponding dual constraint is tight (zero slack).

Modify the primal objective to $P(t) = \min \boldsymbol{c^T}\boldsymbol{x} - t \sum_{j=1}^n \ln(x_j)$, with the same constraints.  This modification effectively punishes low $x_j$ very severely, with $x_j=0$ having value of $\infty$, and is relatively easy to optimize.  As $t$ approaches 0, this modification becomes the original, so we can solve the original by successively reducing $t$.

We have the constraints $X\boldsymbol{s} = t\boldsymbol{1}, A\boldsymbol{x}=\boldsymbol{b}, \boldsymbol{x}\geq 0, A^T\boldsymbol{y}+\boldsymbol{s}=\boldsymbol{c}$, where $X=diag(\boldsymbol{x})$.  We can start with a solution for some $t$ such that $\boldsymbol{x}$ is feasible in (P), $\boldsymbol{y}/\boldsymbol{s}$ are feasible in (D), and $\boldsymbol{x}^T\boldsymbol{s}=tn$.  Our goal is to iteratively find new solutions with lower $t$ until eventually we get $t$ to 0.  In doing this, we can’t quite maintain $X\boldsymbol{s}=t\boldsymbol{1}$, but we can keep the values close, with an $\epsilon$-approximate solution $\|X\boldsymbol{s}-t\boldsymbol{1}\| \leq \epsilon t$.

Now, suppose we have a valid $x,y,z,t$, and we want to make $t$ smaller.  We fix some $\alpha < 1$ and let $\hat{t} = \alpha t$.  Then, solve the following system of linear equations:

$A \Delta x = 0$

$A^T \Delta y + \Delta s = 0$

$S \Delta x + X \Delta s = t 1 - Xs$

We then update $x\leftarrow x + \Delta x$$y \leftarrow y + \Delta y$, $s \leftarrow s + \Delta s$ and $t \leftarrow \alpha t$.

Lemma 1.
Let $\varepsilon < 1/2$. If $(x, y , s)$ is a solution of $P(t)$ with $\| Xs - t \| \leq \varepsilon t$, then $(\hat{x}, \hat{y}, \hat{s}) = (x, y, s) + (\Delta x, \Delta y, \Delta s)$ is a $\frac{1+\varepsilon}{(1-\varepsilon)^2}{\varepsilon}^2$-approximate solution with $t$.

Lemma 2.
Let $\alpha \geq 1 - \frac{c}{\sqrt{n}},$latex (x, y, s)$is a $\frac{1}{5}$-solution, then $(\hat{x}, \hat{y}, \hat{s})$ is an $\varepsilon = \frac{1}{5}$-approximation with $\alpha t$. Theorem 1. This interior point method reaches a solution with duality gap at most $\delta$ in $O(\sqrt{n} \log \frac{x_0^T s_0}{n \delta})$ iterations, where each iteration is the solution of an $(m+n) \times (m+n)$ linear system. Proof. (Theorem 1) Start at $t_0 = 5 \frac{x_0^T s_0}{n}, \varepsilon = \frac{1}{5}$. After each iteration, $t_{i+1} \leq \alpha t_i , \varepsilon = \frac{1}{5}$. To reach $t = 5 \delta$, we need ${\alpha}^k \leq \frac{\delta n}{x_0^T s_0} \implies k \geq \log_\alpha \frac{x_0^T s_0}{n \delta} \geq c\sqrt{n} \log \frac{x_0^T s_0}{n \delta}$ suffices. $\square$ Proof. (Lemma 1) Let $(X, Y, s)$ satisfy $Ax = b$ $A^T y + s = c$ $\| \bar{X}s - t1 \| \leq \varepsilon t$ $x \geq 0$ and let $(\Delta x, \Delta y, \Delta s)$ satisfy $A \Delta x = 0$ $A^T \Delta y + \Delta s = 0$ $S \Delta x + X \Delta s = t1 - Xs$ Note the following: (i) $(\Delta x)^T \Delta s = (\Delta x)^T (-A^T \Delta y) = -(A^T \Delta x)^T \Delta y = 0$ (ii) $(1-\varepsilon)t \leq x_j s_j \leq (1+\varepsilon)t \implies x_j \geq \frac{(1-\varepsilon)t}{s_j} \implies s_j \geq \frac{(1-\varepsilon)t}{x_j}$ So $(1-\varepsilon)t {\| X^{-1}\Delta x \|}^2 = (1-\varepsilon)t(\Delta x)^T X^{-1} X^{-1} \Delta x$ $= (\Delta x)^T X^{-1} \begin{pmatrix} \ddots & 0 & 0 \\ 0 & \frac{(1-\varepsilon)t}{x_j} & 0 \\ 0 & 0 & \ddots \end{pmatrix} \Delta x$ $\leq \Delta x^T X^{-1} S \Delta x$ $= \Delta x^T X^{-1} (t1 - Xs - X \Delta s)$ $= \Delta x^T X^{-1} (t1 - Xs)$ $\leq \| \Delta x^T X^{-1} \| \| t1 - Xs \|$ $\leq \varepsilon t \| \Delta x^T X^{-1} \|$ $\implies \| X^{-1} \Delta x \| \leq \frac{\varepsilon}{1-\varepsilon} < 1$ Similarly, $\| S^{-1} \Delta s \| \leq \frac{\varepsilon}{1-\varepsilon}$. So, $\hat{s} = s + \Delta s = S(1 + S^{-1}\Delta s) \geq 0$ $\hat{x} \geq 0$ $(\hat{x}, \hat{y}, \hat{s})$ is feasible. Next, $\hat{x}_j \hat{s}_j = (x_j + \Delta x_j)(s_j + \Delta s_j)$ $= x_j s_j + x_j \Delta s_j + s_j \Delta x_j + \Delta x_j \Delta s_j$ $= x_j s_j + t - x_j s_j + \Delta x_j \Delta s_j$ $= t + \Delta x_j \Delta s_j$ So, $\| t1 - \hat{x}\hat{s} \| \leq \sum_j | t - \hat{x}_j \hat{s}_j |$ $= \sum_j |\Delta x_j \Delta s_j|$ $= \sum_j \frac{|\Delta x_j|}{x_j} \frac{|\Delta s_j|}{s_j}x_j s_j$ $\leq \sum_j \frac{|\Delta x_j|}{x_j} \frac{|\Delta s_j|}{s_j} (1+\varepsilon)t$ $\leq (1+\varepsilon)t (\sum_j (\frac{\Delta x_j}{x_j})^2)^{\frac{1}{2}}(\sum_j (\frac{\Delta s_j}{s_j})^2)^{\frac{1}{2}}$ $= (1+\varepsilon)t \| X^{-1}\Delta x \| \| S^{-1}\Delta s \|$ $\leq \frac{1+\varepsilon}{(1-\varepsilon)^2}{\varepsilon}^2 t$ $\square$ Proof. (Lemma 2) $\| \hat{X}\hat{s} - \hat{t}\boldsymbol{1} \| = \| \hat{X}\hat{s} - t1 + (1-\alpha}) t1 \|$ $\leq \| \hat{X}\hat{s} - t1 \| + (1-\alpha) t \| 1 \|$ $\leq {\varepsilon}^2 \frac{1+\varepsilon}{(1-\varepsilon)^2} t + (1-\alpha)t\sqrt{n}$ We need ${\varepsilon}^2 \frac{1+\varepsilon}{(1-\varepsilon)^2} t + (1-\alpha)t\sqrt{n} \leq \frac{1}{5} \hat{t} = \frac{\alpha}{5} t$ Let $\alpha = 1- \frac{c}{\sqrt{n}}$. Then, we have $\frac{1+\varepsilon}{(1-\varepsilon)^2} {\varepsilon}^2 + c \leq \frac{1-\frac{c}{\sqrt{n}}}{5}$. Plugging in $\varepsilon = \frac{1}{5}$, we get $\frac{\frac{6}{5}}{\frac{16}{25}} \frac{1}{25} + c = \frac{3}{40} + c \leq \frac{1}{5} - \frac{c}{5\sqrt{n}}$. So solving for $c$, we get $c \leq \frac{\frac{1}{8}}{1 + \frac{1}{5\sqrt{n}}} \leq \frac{1}{16}$ suffices. $\square$ Advertisement ## Ellipsoid Method First, let’s look at using the ellipsoid method to solve LP feasibility (recall that we’ve shown that finding the optimal solution to an LP reduces to solving feasibility). Given constraints $Ax \leq b$, we want to output $x : Ax \leq b$ or indicate that the set of constraints is infeasible. The algorithm maintains an ellipsoid, which we can describe by $E(z, A) = \{ x : (x-z)^T A^{-1} (x-z) \leq 1\}$ Example 1. $A = I, z = 0 \implies E(0, I) = \{x : \|x\|^2 \leq 1\}$ Example 2. $A = \begin{pmatrix} a & 0 \\ 0 & b \end{pmatrix}, z=(1, 1) \implies E(z, A) = \frac{(x_1 - 1)^2}{a^2} + \frac{(x_2 - 1)^2}{b^2} \leq 1$ Algorithm • Initialize $E_0 = (0, R^2 I)$ for sufficiently large $R$ • Repeat: – check if $z$ is feasible – if not, violated constraint $a \cdot x \leq a \cdot z$ – compute minimum volume ellipsoid $E_{t+1}$ containing $E_t \cap \{x : a \cdot x \leq a \cdot z\}$ Lemma 1. $bits(R) \leq poly(n, bits(A), bits(b))$ Lemma 2. $Vol(E_{t+1}) \leq e^{\frac{-1}{2n+2}} Vol(E_t)$ Lemma 3. Given $E(z, A)$, the minimum value ellipsoid containing $E(z, A) \cap \{x : a \cdot x \leq a \cdot z\}$ is $E'(z', A')$ where $z' = z - \frac{1}{n+1} \frac{Aa}{\sqrt{a^T Aa}}$ $A' = \frac{n^2}{n^2 - 1} (A - \frac{2}{n+1} \frac{Aaa^T A^T}{a^T Aa})$ For example, for $A = I, z = 0, a = e_1 = (1, 0, ..., 0)^T$ this is $z' = -\frac{1}{n+1} (1, 0, ..., 0)^T$ $A' = \frac{n^2}{n^2 - 1} (I - \frac{2}{n+1} \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{pmatrix}) = \frac{n^2}{n^2 - 1} \begin{pmatrix} \frac{n-1}{n+1} & 0 & 0 \\ 0 & \ddots & \vdots \\ 0 & \dots & \ddots \end{pmatrix}$ Theorem 1. For a linear program, the ellipsoid algorithm either finds a feasible $x$ or we can go to a lower-dimensional subspace after a polynomial number of iterations. Proof. (Lemmas 2 and 3) Consider the equality $E(z, A) = z + A^{\frac{1}{2}}E(0, I)$. It suffices to understand the minimum half-ball and smallest ellipsoid containing it. Without loss of generality, assume $a = (1, 0, ..., 0)^T$. Then, by symmetry, $E'$ center moves only along $e_1$ and it has axes lengths $(1-z), r, ..., r$ for some $r$. By minimality, $(0, 1, 0, ..., 0)$ is on the boundary of $E'$. So, $((0, 1) - (z, 0)) \begin{pmatrix} \frac{1}{(1-z)^2} & 0 \\ 0 & \frac{1}{r^2} \end{pmatrix} (\begin{pmatrix}0 \\ 1 \end{pmatrix} - \begin{pmatrix} z \\ 0 \end{pmatrix}) = 1$ or $\frac{z^2}{(1-z)^2} + \frac{1}{r^2} = 1 \implies r = \frac{1-z}{\sqrt{1-2z}}$ The volume of $E' = Vol(B_n) \cdot \prod_i axislength_i = Vol(B_n) \cdot (1-z) \cdot r^{n-1} = Vol(B_n) \cdot \frac{(1-z)^n}{(1-2z)^{\frac{n-1}{2}}}$ Minimizing $f(z) = \frac{(1-z)^n}{(1-2z)^{\frac{n-1}{2}}}$ we get $z = \frac{1}{n+1}$ and therefore the axes lengths are $\frac{n}{n+1}, \frac{\frac{n}{n+1}}{\sqrt{\frac{n-1}{n+1}}} = \frac{n}{\sqrt{n^2 - 1}}, ...$ and the volume is $\frac{n^n}{(n+1)(n^2 - 1)^{\frac{n-1}{2}}} = (\frac{n}{n+1})(\frac{n^2}{n^2 - 1})^{\frac{n-1}{2}} = (1 - \frac{1}{n+1})(1 + \frac{1}{n^2 - 1})^{\frac{n-1}{2}}$. Using the bound $1+x \leq e^x$, we get volume $\leq e^{-\frac{1}{n+1}} \cdot e^{\frac{1}{2(n+1)}} = e^{-\frac{1}{2n+2}}$ To prove the theorem, we observe in addition that if $P$ is feasible and bounded, it contains $n+1$ linearly independent vertices, $x^0, ..., x^n$. The simplex formed by them has volume $V = \frac{1}{n!} \begin{vmatrix} 1 & 1 & \dots & 1 \\ x^0 & x^1 & \dots & x^n \\ \vdots & \vdots & \vdots & \vdots \end{vmatrix}$ Each $x_i^j = \frac{\det(C_i^j)}{\det(C^j)}$ where $x^j$ is the solution of $C^j x = d^j$ for some subset of inequalities. $V = \frac{1}{n!} \begin{vmatrix} 1 & \dots & 1 \\ & \vdots & \\ \dots & \frac{\det(C_i^j)}{\det(C^j)} & \dots \end{vmatrix} = \frac{1}{n!} \det \begin{pmatrix} \det(C^0) & \dots & \det(C^n) \\ \det(C^0_1) & \dots \\ \vdots \\ & & \det(C^n_n) \end{pmatrix} \begin{pmatrix} \frac{1}{\det(C^0)} & & \\ & \ddots & \\ & & \frac{1}{\det(C^n)} \end{pmatrix} \geq \frac{1}{n!} \prod_j \frac{1}{\det(C^j)} \geq 2^{-poly(n, bits(A, b))} \square$ ## Proving Strong Duality with Simplex Consider the primal program $P$ $\max c^T x$ $Ax \leq b$ $x \geq 0$ and the dual $D$ $\min b^T y$ $A^T y \geq c$ $y \geq 0$ where $\max c^T x \leq \min b^T y$. We will once again prove strong duality, but this time using the simplex algorithm. Let $z^* = c^T x^*$ be an optimal, bounded primal solution. Add slack variables $x_{n+1} ... x_m$ such that $\max z = \sum_{i=1}^{n} c_i x_i$ $\sum_{j=1}^{n} a_{ij}x_{j} + x_{n+i} = b_i$ $x_1 ... x_{m+n} \geq 0$ So, at the end of our simplex algorithm, we reach $z = z^* + \sum_{j=1}^{n+m} \bar{c}_j x_j$ $\bar{c}_i \leq 0$ Let $y^*_i = -\bar{c}_{n+i}$ We will show that (1) $z^* = b^T y^*$ (2) $A^T y^* \geq c$ Proof. We will first prove (1). $z = z^* + \sum_{j=1}^n \bar{c}_j x_j - \sum_{i=1}^m y^*_i x_{n+i}$ $= z^* + \sum_{j=1}^n \bar{c}_j x_j - \sum_{i=1}^m y^*_i b_i + \sum_{i=1}^m y^*_i (\sum_j a_{ij}x_j)$ $= z^* - \sum_{i=1}^m y^*_i b_i + \sum_{j=1}^n (\bar{c}_j + \sum_{i=1}^m a_{ij}y^*_i)x_j$ holds for all $x$. So $x = 0 \implies z = 0 \implies z^* = b^T y^*$. Now we show (2). We are left with $z = \sum_{j=1}^n (\bar{c}_j + \sum_{i=1}^m a_{ij}y^*_i)x_j = \sum_{j=1}^n c_j x_j$ which holds for all $x$. So, $\bar{c}_j + \sum_{i=1}^m a_{ij}y^*_i = c_j$ $\bar{c}_j \leq 0$ $\implies A^T y^* \geq c$ At the optimal solution we get $z = z^* + \sum_{j=1}^n \bar{c}_j x^*_j - \sum_{i=1}^m y^*_i x_{n+i}^*$ with the second and third terms equal to $0$, $\implies x^*_j > 0 \implies \bar{c}_j = 0$ $\implies y^*_i > 0 \implies x^*_{n+i} = b_i - \sum_{j=1}^n a_{ij}x^*_j = 0$ This is what is referred to as “complementary slackness”. ## Strong Duality Last time we showed that $\max \{\boldsymbol{c^T}\boldsymbol{x} : A\boldsymbol{x}\leq\boldsymbol{b}, x\geq 0\} \leq \min \{\boldsymbol{b^T}\boldsymbol{y} : A^{T}\boldsymbol{y}\geq\boldsymbol{c}, y\geq 0\}$, also known as the Weak Duality Theorem. Let’s now take a look at other standard forms for LPs. Last time we worked mostly with $\max \boldsymbol{c^{T}x}$ $s.t.$ $A\boldsymbol{x} \leq \boldsymbol{b}$ $\boldsymbol{x} \geq 0$ We have already seen that we can switch the objective between minimizing a function or maximizing a function by taking the negative of the original objective. Now we want to show that the following two forms are equally as expressive, $\max \boldsymbol{c^{T}x}$ $s.t.$ $A\boldsymbol{x} = \boldsymbol{b}$ $\boldsymbol{x} \geq 0$ $\max \boldsymbol{c^{T}x}$ $s.t.$ $A\boldsymbol{x} \leq \boldsymbol{b}$ We can transform an inequality constraint of the form $\boldsymbol{a}\cdot\boldsymbol{x} \leq b$ into an equality constraint by adding a nonnegative slack variable such that $\boldsymbol{a} \cdot \boldsymbol{x} + s = b,\ s \geq 0$. An equality constraint $\boldsymbol{a}\cdot\boldsymbol{x} = b$ can easily be transformed into inequality constraints, $\boldsymbol{a}\cdot\boldsymbol{x} \leq b$ and $-\boldsymbol{a}\cdot\boldsymbol{x} \leq -b$. The nonnegativity constraints can be imposed as follows. For each original variable $x$, we have two variables $\boldsymbol{x_1, x_2}$ such that $\boldsymbol{x} = \boldsymbol{x_1} - \boldsymbol{x_2}$. So constraint $\boldsymbol{a}\cdot\boldsymbol{x} \leq b$ would be replaced with $\boldsymbol{a}\cdot(\boldsymbol{x_1}-\boldsymbol{x_2}) \leq b$ and nonnegativity constraints $\boldsymbol{x_1, x_2} \geq 0$. Let’s now look at prelationships between a primal LP that we will call $(P)$ and its dual $(D)$. Theorem 1. (Strong Duality) Given a primal LP $(P)$ and its dual $(D)$, one of the following four possibilities must hold: 1. $(P)$ Infeasible; $(D)$ Unbounded 2. $(P)$ Unbounded; $(D)$ Infeasible 3. $(P)$ Infeasible; $(D)$ Infeasible 4. $(P)$ Bounded & Feasible; $(D)$ Bounded & Feasible with equal optimal values Proof. We can rule out the following three possibilities using Weak Duality, 5. $(P)$ Unbounded; $(D)$ Unbounded 6. $(P)$ Unbounded; $(D)$ Bounded & Feasible 7. $(P)$ Bounded & Feasible; $(D)$ Unbounded Let’s turn our attention to ruling out 8. $(P)$ Bounded & Feasible; $(D)$ Infeasible and by duality, 9. $(P)$ Infeasible; $(D)$ Bounded & Feasible We will need the following lemma Lemma 1. (Farkas’ Lemma) For any $A \in \mathbb{R}^{m \times n},\ \boldsymbol{b} \in \mathbb{R}^m$, exactly one of the following holds: (1) $\exists \boldsymbol{x} : \boldsymbol{x}\geq 0,\ A\boldsymbol{x} = \boldsymbol{b}$ (2) $\exists \boldsymbol{y} : A^{T}\boldsymbol{y} \geq 0,\ \boldsymbol{b^T}\boldsymbol{y} < 0$ Proof. (1) says that $\boldsymbol{b} \in C = cone\{\boldsymbol{a^{(1)}} ... \boldsymbol{a^{(n)}}\}$. If $\boldsymbol{b}$ is not, then there exists a hyperplane that separates $\boldsymbol{b}$ from $C$. We can take this hyperplane to be passing through $\boldsymbol{0}$, with normal $\boldsymbol{y}$. (2) then follows. $\square$ We now use Farkas’ Lemma to prove Strong Duality. Let primal $(P)$ be the following feasible and bounded LP, $\max \boldsymbol{c^{T}x}$ $s.t.$ $A\boldsymbol{x} = \boldsymbol{b}$ $\boldsymbol{x} \geq 0$ and the dual $(D)$ is $\min \boldsymbol{b^{T}y}$ $s.t.$ $A^{T}\boldsymbol{y} \geq \boldsymbol{c}$ and $z > OPT(P)$. Now consider the matrix $\begin{pmatrix} A \\ \boldsymbol{c^T} \end{pmatrix}$ and vector $\begin{pmatrix} \boldsymbol{b} \\ z \end{pmatrix}$. Apply Farkas’ Lemma, so either (1) $\exists \boldsymbol{x}\geq 0 : A\boldsymbol{x} = \boldsymbol{b},\ \boldsymbol{c^{T}x} = z$ — this cannot hold since we assumed $\boldsymbol{c}^Tx or (2) $\exists (\boldsymbol{y}, \alpha) : A^T\boldsymbol{y}+\alpha {\bf c}\ge 0, \boldsymbol{b^{T}y} + \alpha z <0$. Multiplying the first with ${\bf x}^T$, and using $Ax=b$, we get ${\bf b}^Ty + \alpha {\bf c}^T x \ge 0$. Comparing the two inequalities with ${\bf b}^Ty$ and using our assumption that ${\bf c}^T x < z$, we conclude that $\alpha < 0$. Hence, $\boldsymbol{b^T}(\frac{-\boldsymbol{y}}{\alpha}) < z$ $A^{T}(\frac{-\boldsymbol{y}}{\alpha}) \geq \boldsymbol{c}$ which gives us a feasible dual $(-{\bf y}/\alpha)$! Moreover, this gives us $OPT(P) \leq \boldsymbol{b^T}(\frac{-\boldsymbol{y}}{\alpha}) < z$. Since $z$ is an arbitrary value greater than $OPT(P)$, we know that there exists a feasible dual with $OPT(P) = OPT(D)$. $\square$ ## Duality We will now look at a problem that is more general than solving linear systems of the form $A\boldsymbol{x} = \boldsymbol{b}$. Linear programs are of the form: $\min \boldsymbol{c^T}\boldsymbol{x}$ $s.t.$ $A\boldsymbol{x} \geq \boldsymbol{b}$ Note that linear programs have a linear objective ($\boldsymbol{c^T}\boldsymbol{x}$) and linear constraints ($A\boldsymbol{x} \geq \boldsymbol{b}$). We can easily see how to reduce solving a linear system $A\boldsymbol{x} = \boldsymbol{b}$ to a linear program, which would look like $\min \boldsymbol{0}$ $s.t.$ $A\boldsymbol{x} \geq \boldsymbol{b}$ $A\boldsymbol{x} \leq \boldsymbol{b}$ Let’s take a look at a few examples of linear programs. Example 1. (Max Flow) Let $G(V, E)$ be a graph with directed edges and capacities $c_{ij} \geq 0$. Let vertices $s, t \in V$ be the source and the sink respectively. Our goal is to find flows $f_{ij} \geq 0$ such that $f_{ij} \leq c_{ij}$ and $\sum_{i}f_{ij} = \sum_{l}f_{jl}$ $\forall j \neq s,t$. We want to maximize our net flow, which is the total flow out of source $s$, so out linear program is $\max \sum_{j}f_{sj}$ $s.t.$ $\sum_{i}f_{ij} = \sum_{l}f_{jl}$ $\forall j \in V \setminus \{s,t\}$ $f_{ij} \geq 0$ $\forall i,j \in V$ $f_{ij} \leq c_{ij}$ $\forall i,j \in V$ Example 2. (Low-cost Diet) We need 180 units of protein, 100 units of carbs, and 20 units of fat. We can get sushi for 9 units of cost, tomato soup for 2 units of cost, and chicken massaman for 6 units of cost. A single unit of sushi gives 15 protein, 20 carbs, and 5 fat; tomato soup gives 5 protein, 30 carbs, 15 fat; chicken massaman gives 40 protein, 10 carb, 8 fat. What is the minimum cost satisfactory diet? Let’s assign the following variables: $x =$ sushi $y =$ tomato soup $z =$ chicken massaman Then, we get the following linear program, $\min 9x + 2y + 6z$ $s.t.$ $(1)\ \ \ \ 15x + 5y + 40z \geq 180$ $(2)\ \ \ \ 20x + 30y + 10z \geq 100$ $(3)\ \ \ \ 5x + 15y + 8z \geq 20$ Let’s note the costs of some solutions to the above program. $x = 12,\ y = z = 0\ \ \ \ cost = 108$ $x = 2,\ y = 1,\ z = 4\ \ \ \ cost = 44$ Is there a better solution? If not, how can we show that there is no better solution? Let’s work towards a lower bound for $OPT$, the value of the optimal solution to the above program. We will first try taking a multiple of one of our constraints, $\frac{1}{7}(1) = \frac{15}{7}x + \frac{5}{7}y + \frac{40}{7}z \geq \frac{180}{7}$ $\implies 9x + 2y + 6z \geq \frac{180}{7} = 25\frac{5}{7}$ So, $25\frac{5}{7} \leq OPT \leq 44$. Can we tighten this lower bound? We will explore this question on another example. Consider the following LP, $\max 4x_1 + x_2 + 5x_3 + 3x_4$ $s.t.$ $(1)\ \ \ \ x_1 - x_2 - x_3 + 3x_4 \leq 1$ $(2)\ \ \ \ 5x_1 + x_2 + 3x_3 + 8x_4 \leq 55$ $(3)\ \ \ \ -x_1 + 2x_2 + 3x_3 - 5x_4 \leq 3$ $x_1, x_2, x_3, x_4 \geq 0$ Let $z^{*}$ be the value of the optimal solution. Again, let’s try some solutions and note the values $\boldsymbol{x} = (0, 0, 1, 0) \implies z^* \geq 5$ $\boldsymbol{x} = (2, 1, 1, \frac{1}{3}) \implies z^* \geq 15$ How can we tell if this $z^*$ is optimal? $\frac{5}{3}(2) = \frac{25}{3}x_1 + \frac{5}{3}x_2 + 5x_3 + \frac{40}{3}x_4 \leq \frac{275}{3}$ $\implies 4x_1 + x_2 + 5x_3 + 3x_4 \leq \frac{275}{3} \implies z^* \leq \frac{275}{3}$ We can try another linear combination of constraints $(2) + (3) = 4x_1 + 3x_2 + 6x_3 + 3x_4 \leq 58$ $\implies z^* \leq 58$ What is the best possible upper bound? We would like to choose multipliers $y_1, y_2, y_3 \geq 0$ for constraints $(1), (2), (3)$. Then, we get $(y_1 + 5y_2 - y_1)x_1 + (-y_1 + y_2 + 2y_3)x_2 + (-y_1 + 3y_2 + 3y_3)x_3 + (3y_1 + 8y_2 - 5y_3)x_4 \leq y_1 + 55y_2 + 3y_3$ So we’d like, $\min y_1 + 55y_2 + 3y_3$ $s.t.$ $y_1 + 5y_2 - y_1 \geq 4$ $-y_1 + y_2 + 2y_3 \geq 1$ $-y_1 + 3y_2 + 3y_3 \geq 5$ $3y_1 + 8y_2 - 5y_3 \geq 3$ $y_1, y_2, y_3 \geq 0$ Then, this gives a valid upper bound on $z^*$. More generally the solution to the linear program $\min \boldsymbol{b^T}\boldsymbol{y}$ $s.t.$ $A^T \boldsymbol{y} \geq \boldsymbol{c}$ $\boldsymbol{y} \geq 0$ gives an upper bound for the linear program $\max \boldsymbol{c^T} \boldsymbol{x}$ $s.t.$ $A\boldsymbol{x} \leq \boldsymbol{b}$ $\boldsymbol{x} \geq 0$ or more concisely, Theorem 1. (Weak Duality) $\max\{\boldsymbol{c^T}\boldsymbol{x}:A\boldsymbol{x} \leq \boldsymbol{b}, \boldsymbol{x} \geq 0\} \leq \min\{\boldsymbol{b^T}\boldsymbol{y}:A^T \boldsymbol{y} \geq \boldsymbol{c}, \boldsymbol{y} \geq 0\}$ ## Power Iteration Recall the singular value decomposition (SVD) of a matrix $A = UDV^{T} = \sum_{i=1}^{r}\sigma_i \boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^{T}}}$. We showed that $V_{k} = span\{\boldsymbol{v^{(1)}}, ..., \boldsymbol{v^{(k)}}\}$ is the best-fit subspace for $A$, so $V_{k} = \arg\min_{V: dim(V) \leq k}\sum_{i=1}^{m}d(A_{(i)}, V)^2$ We have that $A_{(i)}V_{k}V_{k}^{T}$ is the projection of $A_{(i)}$ to $V_{k}$, and so $AV_{k}V_{k}^{T}$ is the projection of $A$ to $V_{k}$. Let $A_k$ be defined as $AV_{k}V_{k}^{T} = \sum_{i=1}^{k}\sigma_i \boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^{T}}}$. Theorem 1. $\|A-A_{k}\|_{F} = \min_{B: rank(B) \leq k}\|A-B\|_{F}$ AND $\|A-A_{k}\|_{2} = \max_{\|\boldsymbol{v}\| = 1}\|(A-A_{k}) \cdot \boldsymbol{v}\| = \sigma_{k+1} = \min_{B: rank(B) \leq k}\|A-B\|_{2}$ Proof. We proved part (1) last lecture. We will prove part (2) here. Note that $A-A_{k} = \sum_{i=k+1}^{r}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^{T}}}$ and $\|A-A_{k}\|_{2} = \sigma_{k+1}$. We need to show that for any $B: rank(B)=k$, $B=WZ^{T}$ where $W, Z$ have $\leq k$ columns and $\|A-B\|_{2} \geq \sigma_{k+1}$. Since we know that $Z$ has only $k$ columns, $\exists \boldsymbol{v} \in span\{\boldsymbol{v^{(1)}},...,\boldsymbol{v^{(k+1)}}\}$ such that $Z^{T}\boldsymbol{v} = 0$, $\boldsymbol{v} = \sum_{i=1}^{k+1}\alpha_{i}\boldsymbol{v^{(i)}}$, and $\|\boldsymbol{v}\|=1$. $(A-B) \cdot \boldsymbol{v} = A \cdot \boldsymbol{v} = \sum_{i=1}^{k+1}\sigma_{i}\alpha_{i}\boldsymbol{u^{(i)}}$ $\|(A-B) \cdot \boldsymbol{v}\|^{2} = \sum_{i=1}^{k+1}\sigma_{i}^{2}\alpha_{i}^{2} \geq \sigma_{k+1}^{2}$. $\square$ We’ve seen that the SVD is useful in more than one way. Now we will look at how to compute it, a central question in numerical analysis. A simple method, power iteration, is presented here: Set $\boldsymbol{x} \leftarrow e_i$ for $i=1,2,\ldots,n$ Repeat $t$ times:{ $\boldsymbol{x} \leftarrow A^{T}A\boldsymbol{x}$ $\boldsymbol{x} \leftarrow \frac{\boldsymbol{x}}{\|\boldsymbol{x}\|}$ } Theorem 2. Assume $(x\cdot v^{(1)})^2 \ge \frac{1}{n}$. After $t = \frac{\log n}{2\varepsilon}$ iterations, $\boldsymbol{x}$ satisfies $\|A\boldsymbol{x}\| \geq (1-\varepsilon) \sigma_{1}$. Proof. $\boldsymbol{x} = \sum_{i=1}^{r}\alpha_{i}\boldsymbol{v^{(i)}}$ $A^{T}A = \sum_{i=1}^{r}\sigma_{i}^{2}\boldsymbol{v^{(i)}}\boldsymbol{v^{(i)^{T}}}$ $(A^{T}A)^l = \sum_{i=1}^{r}\sigma_{i}^{2l}\boldsymbol{v^{(i)}}\boldsymbol{v^{(i)^{T}}}$ $\frac{\boldsymbol{x}^{T}(A^{T}A)^{l}A^{T}A(A^{T}A)^{l}\boldsymbol{x}}{\boldsymbol{x}^{T}(A^{T}A)^{l}(A^{T}A)^{l}\boldsymbol{x}} = \frac{\sum_{i=1}^{r}\sigma_{i}^{4l+2}\alpha_{i}^{2}}{\sum_{i=1}^{r}\sigma_{i}^{4l}\alpha_{i}^{2}}$ Since $\sigma_{1} \geq \sigma_{2} \geq ...$, the weight on higher $\sigma_{i}$ is increasing. $\sum \sigma_{i}^{4l}\alpha_{i}^{2} = \sum \sigma_{i}^{4l}(\alpha_{i}^{2})^{\frac{2l}{2l+1}}(\alpha_{i}^{2})^{\frac{1}{2l+1}}$ $\leq (\sum \sigma_{i}^{4l+2}\alpha_{i}^{2})^{\frac{2l}{2l+1}}(\sum \alpha_{i}^{2})^{\frac{1}{2l+1}}$ $= (\sum \sigma_{i}^{4l+2}\alpha_{i}^{2})^{\frac{2l}{2l+1}}$ $\therefore \frac{\boldsymbol{x^{l^{T}}}A^{T}A\boldsymbol{x^{l}}}{\boldsymbol{x^{l^{T}}}\boldsymbol{x^{l}}} \geq (\sum \sigma_{i}^{4l+2}\alpha_{i}^{2})^{\frac{1}{2l+1}} \geq \alpha_{1}^{\frac{2}{2l+1}}\sigma_{1}^{2}$. Therefore, after $l$ iterations, $\frac{\|A\boldsymbol{x^{l}}\|^{2}}{\|\boldsymbol{x^{l}}\|^{2}} \geq (\frac{1}{n})^{\frac{1}{2l+1}}\sigma_{1}^{2}$ $= e^{-\frac{\ln n}{2l+1}}\sigma_{1}^{2}$ Setting $2l + 1 = \frac{\ln n}{\varepsilon} \geq e^{-\varepsilon}\sigma_{1}^{2} \geq (1-\varepsilon)\sigma_{1}^{2}.$ $\square$ We note that the choice of the initial $x$ in the algorithm, one of the axis vectors $e_i$, must satisfy the assumption that $(e_i \cdot v^{(1)})^2 \ge \frac{1}{n}$. Instead of trying all $n$ basis vectors, we could also start with a random unit vector. Claim. $\mathbb{E}(\alpha_{1}^{2}) = \frac{1}{n}$ Proof. $\sum x_{i}^{2} = 1$ symmetric. So, $\mathbb{E}(x_{1}^{2}) = \mathbb{E}(x_{2}^{2}) = ... = \mathbb{E}(x_{n}^{2})$. $\implies n\mathbb{E}(x_{1}^{2}) = 1 \implies \mathbb{E}(x_{1}^{2}) = \alpha_{1}^{2} = \frac{1}{n}$. $\square$ Moreover, $\alpha_1^2 \ge \frac{1}{n}$ with some constant probability. So a few runs of the algorithm will suffice with high probability. ## Singular Value Decomposition Last time we saw that $\min \|Ax-b\|$ is achieved by $x = A^{+}b$ where $A^{+} = VD^{-1}U^T$ depended on the singular value decomposition of $A$. This time we will explore singular value decomposition further. Theorem 1. Any $m \times n$ real matrix $A$ can be written as $A = UDV^T$ where $U,V$ have orthonormal columns and $D$ is a diagonal matrix with entries $\sigma_1 \geq \dots \geq \sigma_r > 0$ where $r = rank(A)$. Note. $AV = UDV^{T}V = UD$ $A^{T}U = VD$ or equivalently, $A\boldsymbol{v^{(i)}} = \sigma_i \boldsymbol{u^{(i)}}$ $A^{T}\boldsymbol{u^{(i)}} = \sigma_i \boldsymbol{v^{(i)}}$ $\boldsymbol{u^{(i)}}$ is a left singular vector $\boldsymbol{v^{(i)}}$ is a right singular vector Lemma 1. $\boldsymbol{v^{(1)}} = \arg\max_{\|\boldsymbol{v}\|=1}\|A\boldsymbol{v}\|^2$ Proof. (Linear Algebra) $A^{T}A\boldsymbol{v} = \sigma A^{T}\boldsymbol{u} = \sigma^2 \boldsymbol{v}$ So, each $\boldsymbol{v^{(i)}}$ is an eigenvector of $A^{T}A$ and each $\boldsymbol{u^{(i)}}$ is an eigenvector of $AA^{T}$ For any eigenvector $\boldsymbol{v}$ of $A^{T}A$, $A^{T}A\boldsymbol{v} = \lambda\boldsymbol{v}$ define $\boldsymbol{u} = \frac{1}{\sqrt{\lambda}}A\boldsymbol{v}$ Then, $A\boldsymbol{v} = \sqrt{\lambda}u$ and $A^{T}\boldsymbol{u} = \sqrt{\lambda}v$. So $\arg\max_{\|\boldsymbol{v}\|\leq 1}\|A\boldsymbol{v}\|^2$ is the top eigenvector of $A^{T}A$, and hence, the top right singular vector of $A$. $\square$ Proof. (Calculus) Consider the objective function $f(\boldsymbol{v}) = \|A\boldsymbol{v}\|^2 = \boldsymbol{v^{T}}A^{T}A\boldsymbol{v}$. Then, $\nabla f(\boldsymbol{v}) = 2A^{T}A\boldsymbol{v}$. Intuitively we can see that if $\boldsymbol{v}$ is a point on the surface of a sphere, defined by $\|v\| \leq 1$, then at any local optimum $\boldsymbol{u'}$, $\nabla f(\boldsymbol{u'})$ is parallel to $\boldsymbol{u'}$. The inverse is also true. This relationship is represented by $2A^{T}A\boldsymbol{v} = \lambda \boldsymbol{v}$. By the above, we can see that all local optima of $\|A\boldsymbol{v}\|^2$ are eigenvectors. So, $\max \|A\boldsymbol{v}\|^2$ is achieved by an eigenvector of $A^{T}A$, or a right singular vector of $A$. $\square$ Can we reason similarly about $\boldsymbol{v^{(i)}}, \boldsymbol{u^{(i)}}\ \forall\ i=2 \dots r$? Lemma 2. $\boldsymbol{v^{(1)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1}\|A\boldsymbol{v}\|^2$ $\boldsymbol{v^{(2)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1 \atop \boldsymbol{v} \bot \boldsymbol{v^{(1)}}}\|A\boldsymbol{v}\|^2$ $\vdots$ $\boldsymbol{v^{(k)}} = \arg\max_{\|\boldsymbol{v}\|\leq 1 \atop \boldsymbol{v} \bot \boldsymbol{v^{(1)}} ... \boldsymbol{v^{(k-1)}}}\|A\boldsymbol{v}\|^2$ Then, $\boldsymbol{v^{(i)}} \ \forall\ i=2 \dots r$ are right singular vectors of $A$. We can view the rows $A_{(i)}$ of $A$ as points in $\mathbb{R}^n$. Then, $\boldsymbol{v^{(1)}}$ is the best-fit line (that goes through the origin), or the best-fit one-dimensional subspace. This is represented by, $\boldsymbol{v^{(1)}} = \arg\min_{\|v\|=1}\sum_{i+1}^{m}d(A_{(i)},\boldsymbol{v})^2$ Proof. $\sum_{i}\|A_{(i)}\|^2 = \sum_{i}d(A_{(i)},\boldsymbol{v})^2 + \sum_{i}\|A_{(i)} \cdot \boldsymbol{v}\|^2$ is given by the Pythagorean Theorem. So, minimizing $\sum_{i}d(A_{(i)},\boldsymbol{v})^2$ requires that we proportionally maximize $\sum_{i}\|A_{(i)} \cdot \boldsymbol{v}\|^2 = \|A\boldsymbol{v}\|^2$. $\square$ Theorem 2. $V_k = span\{\boldsymbol{v^{(1)}} ... \boldsymbol{v^{(k)}}\}$ Then, $V_k$ is the best-fit k-dimensional subspace for A, $V_k = \arg\min_{V:dim(V) \leq k}\sum_{i}d(A_{(i)}, V)^2$ Proof. We will do an induction on $k$. $k = 1$: Lemma 1 Suppose Theorem 2 is true for (k-1)-dimensional subspace. So, $V_{k-1}$ is the best-fit (k-1)-dimensional subspace. Let $W$ be the best k-dimensional subspace. We know that $\exists \boldsymbol{w_k} \in W$ such that $\boldsymbol{w_k} \bot V_{k-1}$ since $dim(W \cap V_{k-1}) \leq k-1$. So, $W = span\{\boldsymbol{w_k}, W_{k-1}\}$ $d(A, W)^2 = d(A, \boldsymbol{w_k})^2 + d(A, W_{k-1})^2$ $\geq d(A, \boldsymbol{w_k})^2 + d(A, V_{k-1})^2$ since $V_{k-1}$ is best (k-1)-dimensional subspace, by IH $> d(A, \boldsymbol{v^{(k)}})^2 + d(A, V_{k-1})^2$ since $\boldsymbol{v^{(k)}}$ is the best among vectors orthogonal to $V_{k-1}$ $= d(A, V_k)^2$. $\square$ Proof of Theorem 1. We want to show that $A = \sum_{i=1}^{n}\sigma_{i} \boldsymbol{u^{(i)}}\boldsymbol{v^{(i)}}^T$. The LHS and RHS are both matrices. Since $LHS \cdot \boldsymbol{v^{(i)}} = A\boldsymbol{v^{(i)}} = \sigma_{i}\boldsymbol{u^{(i)}}$ and $RHS \cdot \boldsymbol{v^{(i)}} = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}}\boldsymbol{v^{(i)}} = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}$, we get that $A\boldsymbol{v} = (\sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}})\boldsymbol{v}$ for every $\boldsymbol{v} \in \mathbb{R}^n \implies A = \sum_{i=1}^{n}\sigma_{i}\boldsymbol{u^{(i)}}\boldsymbol{v^{(i)^T}}$ ## Rational Approximation Given a real number or a rational with large integers $r$, how can we approximate this by a rational with small integers? Let’s recall the Euclidean algorithm to find $\gcd(a,b)$ via the following example: $\gcd(209, 171)$ $209 = 171 \cdot 1 + 38$ $\gcd(171, 38)$ $171 = 38 \cdot 4 + 19$ $\gcd(38, 19) = 19$ In general, for $a = b \cdot q + r$, we use $\gcd(a,b) = \gcd(b,r)$, which is much faster than factorization! To see this algorithm in another setting, we recall continued fractions. For this, we will consider the example of approximating $e$ by a rational. $e = 2 + 0.71828\dots = 2 + (e-2)$ $= 2 + \frac{1}{1 + \frac{3-e}{e-2}}$ $= 2 + \frac{1}{1 + \frac{1}{2 + \frac{3e-8}{3-e}}}$ $= 2, 2 + \frac{1}{1}, 2 + \frac{1}{1 + \frac{1}{2}}, \dots$ $= 2, 3, 2\frac{2}{3}, \dots$ Theorem 1. For any integer $B>0$, the best rational approximation of any real $r$ using denominator at most $B$ is given by a continued fraction approximation of $r$. $r = a_0 \frac{1}{a_1 + \frac{1}{a_2 + \frac{1}{a_3 + \frac{1}{a_4 + \dots}}}}$ The theorem will follow from the next lemma. Lemma 2. $(1)$ $p_{k+1} = a_{k}p_{k} + p_{k-1}$ $q_{k+1} = a_{k}q_{k} + q_{k-1}$ $p_{-1} = 0, p_{0} = 1, q_{-1} = 1, q_{0} = 0$ $(2)$ $p_{k}q_{k+1}-p_{k+1}q_{k} = (-1)^{k} \implies |\frac{p_k}{q_k} - \frac{p_{k+1}}{q_{k+1}}| = \frac{1}{q_{k}q_{k+1}}$ $(3)$ $\forall r, \forall B \geq b, \exists$ unique $\frac{a}{b}$ such that $|r - \frac{a}{b}| < \frac{1}{2bB}$ $(4)$ $\frac{a}{b}$ from $(3)$ is given by some $\frac{p_k}{q_k}$ Proof. $(1)$ is left as homework. $(2)$ is by substitution. Note that $r - \frac{p_k}{q_k}$ alternates in sign and converges. $(3)\ \exists \frac{a_1}{b_1}, \frac{a_2}{b_2}$ with $b_1, b_2 \leq B$ $|\frac{a_1}{b_1} - \frac{a_2}{b_2}| \leq |\frac{a_1}{b_1}-r| + |\frac{a_2}{b_2}-r| < \frac{1}{2b_{1}B} + \frac{1}{2b_{2}B} \leq \frac{1}{b_{1}b_{2}}$ But for any two $\frac{a_1}{b_1} \neq \frac{a_2}{b_2}$, $|\frac{a_1}{b_1} - \frac{a_2}{b_2}| \geq |\frac{a_{1}b{2} - a_{2}b{1}}{b_{1}b_{2}}| \geq \frac{1}{b_{1}b_{2}}$, where we reach a contradiction. $(4)$ Suppose $\exists \frac{a}{b} \neq \frac{p_k}{q_k}$ such that $|\frac{a}{b}-r| < \frac{1}{2bB}$ For$latex q_k \leq B |\frac{a}{b}-r| \geq \min(|\frac{a}{b} – \frac{p_k}{q_k}|, |\frac{a}{b}-\frac{p_{k+1}}{q_{k+1}}|)$$\frac{1}{2bB} \geq \frac{1}{bq_{k+1}}$ $\implies q_{k+1} > 2B$ So, $|r-\frac{p_k}{q_k}| \leq \frac{1}{q_{k}q_{k+1}} < \frac{1}{2q_{k}B}$ But such a solution is unique! We will now use lemma 2 to complete our algorithm for solving $Ax=b$ exactly. Recall we have $\bar{x}$ where $A\bar{x} = b\ (mod\ p^m)$. For each $x_i$, our goal is to find $c_i, d$ such that $dx_i = c_{i}\ (mod\ p^m)$ Then, $A\frac{c}{d} = b$ since $\|Ac\|, \|b\| << p^m$. Lemma 3. Suppose $\exists$ integers $f,g$ such that $|f|,|g|<\frac{1}{2}\sqrt{h}$ and $gs = f\ (mod\ h)$. Let the continued fraction approximations of $\frac{s}{h}$ be $\frac{w_1}{v_1} \dots \frac{w_k}{v_k} \dots$ and define $u_i = v_{i}s - w_{i}h$. Let $k$ be the smallest integer such that $|u_k| \leq \sqrt{h}$. Then, $\frac{f}{g} = \frac{u_k}{v_k}$. This lemma solves our problem. We can quickly convert $\bar{x}$ to a rational solution via the continued fraction approximations of $\frac{x_i}{p^m}$ and finding the corresponding $\frac{u_k}{v_k}$. Proof. Note that $\{w_i\},\{v_i\}$ are increasing while $|u_i|$ is decreasing (why?). Let $f = gs - th$. Then, $|\frac{s}{h} - \frac{t}{g}| = |\frac{f}{hg}| = |\frac{fg}{hg^2}| < \frac{1}{2g^2}$ By Lemma 2, $\frac{t}{g}$ must be some $\frac{w_j}{v_j}$. $\implies |u_j| \leq |f| \leq \frac{1}{2}\sqrt{h}$, so $j \geq k$, since $u_j$ is decreasing. But $u_{j}v_{k}-u_{k}v_{j} = (v_{j}s-w_{j}h)v_k - (v_{k}s-w_{k}h)v_j = (-w_{j}v_k + w_{k}v_j)h = 0\ (mod\ h)$ and $|u_{j}v_{k}-u_{k}v_{j}| \leq (|u_j|+|u_k|)|v_j| \leq (\frac{1}{2}\sqrt{h} + \sqrt{h})\frac{1}{2}\sqrt{h} < h.$ $\implies u_{j}v_{k}=u_{k}v_{j}$ i.e. $j=k$ and $\frac{f}{g} = \frac{u_k}{v_k}$. ## Regression Let’s think about the problem of solving $\min||Ax-b||$ where $A$ is an $m \times n$ matrix. Let’s instead look at minimizing the objective function $f(x) = \|Ax-b\|^2$. First, we’ll calculate the gradient and set it equal to zero, since we know that must be the global minimum of our objective, $\nabla f = 2A^{T}(Ax-b) = 0$ $\implies A^{T}Ax = A^{T}b$ Then, if $A^{T}A$ has full rank (or, equivalently, $A$ has full column rank), then $x = (A^{T}A)^{-1}A^{T}b$ is the unique solution. $\min \|Ax-b\| = dist(b,S)$ The proof that this minimum is unique is left as an exercise. So, the optimal $\hat{x}$ achieves $A\hat{x} = \hat{b}$. Lemma 1. $A(A^{T}A)^{-1}A^{T}$ is the orthogonal projection from $\mathbb{R}^m$ to $S = span\{\vec{a_1} \dots \vec{a_n}\}$. Proof. Take any column of $A$, say $\vec{a_i} = A\vec{e_i}$. Then, $A(A^{T}A)^{-1}A^{T}\vec{a_i} = A\vec{e_i} = \vec{a_i}$ Now consider any $y \bot \{\vec{a_i} \dots \vec{a_n}\}$, so $y \cdot \vec{a_i} = 0\ \forall i = 0 \dots n$. Then, $A(A^{T}A)^{-1}A^{T}y = 0$. So, for any vector $y \in \mathbb{R}^m$, $y = \sum_{i}\alpha_{i}\vec{a_i} + y'$ $A(A^{T}A)^{-1}A^{T}y = A(A^{T}A)^{-1}A^{T}(A\sum_{i}\alpha_{i}\vec{e_i})$ $= \sum_{i}\alpha_{i}\vec{a_i}$ is the projection of $y$ onto $span\{\vec{a_1} \dots \vec{a_n}\}$. $\square$ What if the columns of $A$ are not linearly independent, and so, $(A^{T}A)^{-1}$ is not well defined? Let’s get familiar with what’s known as the Singular Value Decomposition, $A = UDV^T$, with the following definitions, $r = rank(A)$ $\sigma_1 \geq \dots \geq \sigma_r > 0$ $U = \{\vec{u_1} \dots \vec{u_r}\}$ orthonormal columns in $\mathbb{R}^m$ $D = Diag(\sigma_1 \dots \sigma_r)$ $V = \{\vec{v_1} \dots \vec{v_r}\}$ orthonormal columns in $\mathbb{R}^n$ Theorem 1. For any $A \in \mathbb{R}^{m \times n}$, the singular values are unique. The singular values are paired $u_{i}, v_{i}$ such that , $Av_i = \sigma_{i}\vec{u_i}$ AND $\vec{u_i}^{T}A = \sigma_{i}\vec{v_i}$ Fact. If $A$ is nonsingular, then $A^{-1} = VD^{-1}U^{T}$. Proof. If $A$ is nonsingular, $m = n = r$. Therefore, $AA^{-1} = UDV^{T}VD^{-1}U^{T} = UU^{T}$. We know $U^{T}U = I \implies U^T = U^{-1} \implies UU^T = I$. $\square$ To continue our search for $\arg\min_{x}\|Ax-b\|$, we will need to be familiar with the pseudo-inverse of $A$, $A^+ = VD^{-1}U^{T}$. Note the following properties, $U^{T}U = V^{T}V = D^{-1}D = I$ which we will use in the following, $AA^{+}A = UDV^{T}VD^{-1}U^{T}UDV^{T} = UDV^{T} = A$ $A^{+}AA^{+} = A^{+}$ Theorem 2. $\arg\min_{x}\|Ax-b\| = A^{+}b$ Proof. We know that the solution must satisfy $A^{T}Ax = A^{T}b$. Let’s verify that $x = A^{+}b$ satisfies this condition. $A^{T}Ax = VDU^{T}UDV^{T}VD^{-1}U^{T}b = VDU^{T}b = A^{T}b$ But, is this solution unique? $Ax = AA^{+}b = UDV^{T}VD^{-1}U^{T}b = UU^{T}b$ Remember, $U$ has orthonormal columns; so, it’s columns form a basis of $span\{\vec{a_1} \dots \vec{a_n}\}$. $\vec{u_i} \cdot b$ are the coordinates of $b$ along$\vec{u_i}\$.
The projection of $b$ onto $span\{\vec{a_1} \dots \vec{a_n}\}$ is exactly $UU^{T}b$.
So, $A^{+}b$ is the unique minimum solution. $\square$

## Faster Gaussian Elimination

Last time we saw how to solve $Ax = b$ with a complexity of $O(n^{4}(\log n + b))$. We now want to see if we can solve the same question with $O(n^{3})$ complexity.

Consider solving $Ax = b\ (mod\ p)$ where $p$ is a prime. We can still use Gaussian Elimination and, without loss of generality, assume that $a_{ij} \in \{0, \dots, p-1\}$.

Let’s work through an example:

$\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} x = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$

$= \left[\begin{array}{ccc|c} 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 \end{array}\right] =_{(3)-(1)} \left[\begin{array}{ccc|c} 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 1 \\ 0 & 1 & 0 & 0 \end{array}\right] =_{(3)-(2)} \left[\begin{array}{ccc|c} 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 \end{array}\right]$

We then solve for $x$ to get

$x = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}$

In general, we will need to be familiar with the multiplicative inverse which is

$b^{-1} = a \in \{1, \dots, p-1\}$ s.t. $ab = 1\ (mod\ p)$

For example,

$5^{-1} = 3\ (mod\ 7)$
$4^{-1} = 2\ (mod\ 7)$
$6^{-1} = 6\ (mod\ 7)$

So, coefficients of $A$ remain in $\{0, 1, \dots, p-1\}$ and complexity $O(n^{3}\log p)$.

Can we use this to solve $Ax = b$ faster?

It turns out, YES, we can!

1. First, solve $A\bar{x}=b\ (mod\ p^m)$ for a “large enough” $m$.
2. Convert $\bar{x}$ to rational $x$ with “not too large” numerator and denominator.

We want $m$ to be large enough such that the resulting $\bar{x}$ the unique solution to $A\bar{x}=b\ (mod\ p^m)$. How large, exactly, is that?

We know that $A$ has entries with $b$ bits. So,

$|det(A)| \leq (2^{2b}n)^n = 2^{(2b+\log{n})n}$

This implies that every numerator/denominator of coordinates of $x$ is an integer $q \leq 2^{(2b+\log{n})n}$.

Assume $p < 2^{2b}$ and $p$ does not divide $det(A)$.

Now, our high level approach to solving $Ax = b$ will be to take the following steps:

1. Solve $A\bar{x} = b\ (mod\ p^m)$ in $\tilde{O}(n^3 + mn^2)$.

2. Set $m = O(n\log{n})$ and recover $x$ from $\bar{x}$ such that $Ax = b$.

First, we will try to find $\frac{c_{i}}{d} = x_{i}$ where $c_i$ is a vector and $d$ is an integer with $|c_i|, d \leq q$. Why?

$Ac = dA\bar{x}\ (mod\ p^m) = db\ (mod\ p^m)$
$A\frac{c}{d} = b\ (mod\ p^m)$

$|A_{i}c| \leq ||A_i|| \max_{j}|c_j| \leq 2^{2b+\log{n}}q 2^{(2b + \log{n})(n+1)}$, so $m > \frac{(2b + \log{n})(n+1)}{\log{p}}$ will suffice.

We now present the algorithm to solve the first step:

1. Find $C$

$AC = I\ (mod\ p)$

2. Set $b_0 = b$. For $i = 0 \dots m-1$

$x_i = cb_i\ (mod\ p)$
$b_{i+1} = \frac{1}{p}(b_i - Ax_i)$

3. Calculate $\bar{x}$ and $A\bar{x}$.

$\bar{x} = \sum_{i=0}^{m-1} x_{i}p^{i}$
So,
$A\bar{x} = \sum_{i=0}^{m-1} Ax_{i}p^{i}$
$= \sum_{i=0}^{m-1} (b_{i} - pb_{i+1})p^{i}$
$= \sum_{i=0}^{m-1} b_{i}p^{i} - b_{i+1}p^{i+1}$
$= b_0 - b_{m}p^m$
$= b_0\ (mod\ p^m)$

Next time we will look at how to approximate a given real/rational by a rational with small integers using the Euclidean algorithm for greatest common divisor. We will use this to complete the solution to $Ax=b$.