blank

Resource list

2025-03-09T00:00:00+00:00

This post is a collection of good resources I have used during my studies to learn topics in convex optimization and computer systems and my review of each resource. This is intended to be a living document, and I will update this list every once in a while.

Disclaimer: I am not an expert in optimization or computing, but I am writing this post because I feel like I have spent enough time learning both to be able to recommend good resources.

Optimization

This section mostly focuses on convex optimization.

Convex Optimization by Boyd and Vandenberghe

This textbook which can freely be found here is the classic introduction resource to convex optimization. This textbook has corresponding lectures by Stephen Boyd which can be found here. Additionally, Professor Boyd teaches this course at Stanford University and the course webpage which has all the lecture notes can be found here.

The textbook has three parts: theory, application, and algorithms. The theory part of the textbook mostly focuses on what is convexity, why is it important, and what are operations that preserve convexity. Then it covers the classical convex optimization problems such as linear programs, quadratic programs, second-order cone programs, and semidefinite programs. Finally, it covers duality and the Karush-Kuhn-Tucker (KKT) conditions.

I think this book is a good introduction to convex optimization, but it is not my favorite textbook to learn optimization algorithms from. Although part 3 of the textbook does cover optimization algorithms, it mostly focuses on primal interior point methods, which are not used much by popular convex solvers. Additionally, I do not care much for the presentation of duality, it feels quite prescriptive in my opinion and a bit unmotivated. Boyd presents duality as a “structured way to create lower bounds for optimization problems”, which although that is true, I believe the real importance of duality in convex optimization is to aid in algorithm design.

Nevertheless, I recommend any beginner to read the first part of this book before reading any other textbook on this list.

Numerical Optimization by Nocedal and Wright

This is one of the best textbooks for learning about optimization algorithms. The beginning of the book discusses optimization algorithms for unconstrained nonconvex problems. To be honest I have not read much of that part. Chapters 12-17 are a goldmine for learning about algorithms for constrained convex optimization. I also love the presentation of duality in Chapter 12 far more than Boyd’s treatment. In this textbook duality is presented as a tool in algorithm design.

I also enjoy the historical perspective this textbook brings on optimization algorithms. First, the simplex method for linear programming is discussed, then its drawbacks (worst case exponential runtime) are presented, then interior point methods are presented. This textbook in my opinion, has the best explanation for how primal and primal-dual interior point methods work. My only complaint is that the primal-dual interior point method is only described for linear programming and quadratic programming and does not include second-order cone programming and semidefinite cone programming. The book also discusses active set methods for quadratic programming and presents the material is a very clear way.

My only complaints about this book is that primal-dual interior point methods are not described for second-order cone programming and semidefinite programming. Also operator splitting methods such as forward-backwards splitting, ADMM, and PDHG are not discussed.

Large-Scale Convex Optimization by Ryu and Yin

This textbook can be freely found here with accompanying videos here.

This is my favorite textbook of all time by a long shot, and is the best resource I have found to understand operator-splitting methods. I have always heard about projected gradient descent, the proximal point method, ISTA, augmented Lagrangian methods, Douglas-Rachford Splitting, ADMM, PDHG etc, and always viewed them as separate concepts in my mind. However, this textbook unifies all of these algorithms as fixed-point iterations with some averaged operator. Even reading the first three chapters will be an extremely eye-opening experience to those interested in operator-splitting methods.

Convex Analysis and Nonsmooth Optimization by Drusvyatskiy

This set of course notes discusses the fundemental concepts in convex analysis in a rigorous way. I used these notes in my Convex Analysis class and enjoyed them.

Convex Analysis and Monotone Operator Theory by Baushke and Combettes

Disclaimer: I have not read this textbook, but it has been highly praised for rigorously discussed Monotone operator theory which is fundamental for rigorously analyzing operator splitting methods.

Computing

The Cherno

The YouTube playlist is here. This is the best video series to learn C++. I love that he discusses not only C++ syntax, but what is happening under the hood in terms how data and the program is stored in memory.

Building an 8-bit computer

This video series by Ben Eater is a fantastic introduction to computer architecture. He starts from simple and/or/not gates and builds an 8-bit computer on a breadboard while explaining everything you need to know along the way.

Although this series does not discuss important topics for performance such as cache and branch prediction, this is a fantastic video series for a beginner who wants to understand how a CPU works.

Computer Systems Programming

This video series is from the 15-213: Introduction to Computer Systems course at Carnegie Mellon University and follows this textbook.

This is my favorite resource to learn about lower level concepts in computing such as assembly, memory cache, and writing more performant code.

Performance Aware Programming

This course by Casey Muratori is fantastic for understanding how to write code in a performance oriented way. I am currently working through it.

Math

Elementary Analysis by Kenneth Ross

I learned basic analysis from this textbook. The book is well written, I like the examples and I like that the textbook is not overly rigorous like Rudin. I think it is a good intro to analysis.

Introductory Functional Analysis by Erwin Kreyszig

I have skimmed the first few chapters. The author does a good job of motivating the need for the abstraction of metric and Banach spaces. Most of the material in the first two chapters was pretty easy to grasp with knowledge of proofs, linear algebra, and analysis. I started reading this textbook so I could understand Convex Analysis and Monotone Operator Theory by Baushke and Combettes, as they work in Hilbert spaces. I also think this material is pretty interesting in its own right.

Solving the Lasso Problem

2023-10-25T00:00:00+00:00

Introduction

Lasso regression is and important regularization technique for linear regression that can also perform variable selection. What this means is that solutions to the lasso problem tend to be sparse (contain zeros) which allows us to rule out certain independent variables in our model. A great resource to familiarize yourself with lasso is this video.

Lasso is of incredible importance in statistics, signal processing, compressed sensing, and image processing. In this post we will look at a variety of optimization technique for solving the lasso problem.

The lasso optimization problem is an unconstrained optimization problem which can be written as follows:

\[\begin{split} \underset{x}{\text{minimize}} \quad & \frac{1}{2}\|Ax-b\|_2^2 + \lambda \|x\|_1 \\ \end{split}\]

where $A \in \mathbb{R}^{m \times n}$

A first thought to solve this problem might be gradient descent, however the objective function is non-smooth (due to the l1 penalty), so we cannot use gradient descent. A second thought is to use subgradient descent, which is a generalization of gradient descent to non-smooth functions. This would work, but subgradient descent has an extremely slow worst-case convergence rate of $\mathcal{O}(1 / \sqrt{t})$ (meaning you need four times the iterations to double the accuracy) so we will look at better algorithms.

Reformulating as Quadratic Program

One of the easiest things to do would be reformulating this problem as a Quadratic Program (QP) as follows:

\[\begin{split} \underset{x}{\text{minimize}} \quad & \frac{1}{2}\|Ax-b\|_2^2 + \lambda \sum_{i=1}^n t_i \\ \text{subject to} \quad & -t_i \leq x_i \leq t_i \quad \forall i \in [1,n] \end{split}\]

To see how this was done reference the Mosek Modeling Cookbook

We can then feed this into a QP solver such as OSQP and then get an answer. This works, but it feels wasteful to turn an unconstrained problem into a constrained one and then use a generic QP solver. There should be better algorithms that exploit the structure of our problem where we have a smooth plus a non-smooth term.

ISTA

ISTA or iterative shrinking threshold algorithm is an application of the proximal gradient method to the Lasso problem.

The proximal gradient method solves problems of the following form, where $f$ is differentiable

\[\begin{split} \underset{x}{\text{minimize}} \quad & f(x) + g(x) \\ \end{split}\]

The algorithm looks as follows:

\[x_{k+1} = \boldsymbol{\text{prox}}_{\eta g}(x_k - \eta \nabla f(x_k))\]

where $\eta$ is the gradient descent stepsize for $f$ which will be the inverse of the Lipschitz constant of $f$.

The proximal operator $\boldsymbol{\text{prox}}_{\eta g}$ is a generalization of the projection operation and is defined as follows

\[\boldsymbol{\text{prox}}_{\eta f}(v) = \underset{x}{\text{argmin}} \left(f(x) + \frac{1}{2\eta}\|x-v\|_2^2\right)\]

You can think of the proximal operator as returning a point which balances minimizing the function and staying close to the current point. The proximal operator for many function are well known in closed form. For more information on proximal operators and algorithms using proximal operators, refer to .

The idea behind the proximal gradient method is to perform a gradient descent step assuming we are just going to be minimizing the smooth function $f$ and then do an evaluation of the proximal operator for $g$ which can be interpreted as a gradient descent step on the smoothed version of $g$. More technically we do a gradient descent step on the Moreau envelope of $g$.

Alternating between these two steps, we eventually minimize our original objective function.

For Lasso we will take,

\[\begin{split} f(x) &= \frac{1}{2}\|Ax-b\|_2^2 \\ g(x) &= \|x\|_1 \\ \end{split}\]

It can be shown that the proximal operator for the l1 norm is the soft threshold operator

\[\boldsymbol{\text{prox}}_{\eta \|\cdot\|_1}(v) = \mathcal{S}_\eta(v) = \text{sign}(v)\max(|v|-\eta,0)\]

We can now write out the ISTA iterates as follows

\[x_{k+1} = \mathcal{S}_{\lambda/L}\left(x_k - \frac{1}{L}A^\top(Ax_k-b)\right)\]

Where $L$ is the maximum eigenvalue of $A^TA$. This can quickly be computed via power iteration.

It can be shown that this algorithm has a worst-case convergence rate of $\mathcal{O}(1 / t)$ meaning that if we double the number of iterations, we double the accuracy of the solution. This is already better than subgradient method, but is not the best we can do.

FISTA

ISTA was used for a while, but many researchers noticed that it can be painfully slow to converge. In 2009, Beck and Teboulle introduced FISTA (Fast Iterative Shrinking Threshold Algorithm) where they used momentum to accelerate ISTA and were able to achieve the worst-case convergence rate of $\mathcal{O}(1 / t^2)$ , meaning that if we double the number of iterations, we quadruple the accuracy of the solution . FISTA can be thought of as applying ideas from Nesterov’s Accelerated Gradient to ISTA.

This algorithm can be written as follows

\[\begin{align} x_k &= \mathcal{S}_{\lambda/L}\left(y_k - \frac{1}{L}A^\top(Ay_k-b)\right) \\ t_{k+1} &= \frac{1+\sqrt{1+4t_k^2}}{2} \\ y_{k+1} &= x_k + \left(\frac{t_k-1}{t_{k+1}}\right)(x_k-x_{k-1}) \end{align}\]

ADMM

The final algorithm we will consider for Lasso is the Alternating Direction Method of Multiplers (ADMM). This algorithm was introduced in the mid-1970s, but became popular again after Stephen Boyd et al published their paper in 2011 . This algorithm attempts to solve problems of the following form

\[\begin{split} \underset{x,z}{\text{minimize}} \quad & f(x) + g(z) \\ \text{subject to} \quad & x = z, \\ \end{split}\]

The iterates of the algorithm are as follows:

\[\begin{align} x_{k+1} &= \underset{x}{\text{argmin}} \; \mathcal{L}_\rho(x,z_k,y_k) \\ z_{k+1} &= \underset{z}{\text{argmin}} \;\mathcal{L}_\rho(x_{k+1},z,y_k) \\ y_{k+1} &= y_k + \rho(x_{k+1}-z_{k+1}) \end{align}\]

where

\[\mathcal{L}_\rho(x,z,y) = f(x) + g(z) + y^T(x-z) + \frac{\rho}{2}\|x-z\|_2^2\]

For the case of Lasso, we have

\[\begin{align} f(x) &= \frac{1}{2}\|Ax-b\|_2^2 \\ g(z) &= \|z\|_1 \end{align}\]

and the ADMM iterates become

\[\begin{align} x_{k+1} &= (A^TA+\rho I)^{-1}(A^Tb+\rho z_k -y_k) \\ z_{k+1} &= \mathcal{S}_{\lambda/\rho} (x_{k+1}+y_k/\rho) \\ y_{k+1} &= y_k + \rho(x_{k+1}-z_{k+1}) \end{align}\]

Here, $\rho > 0$ is a stepsize.

Results

Now we will test the QP version of Lasso, ISTA, FISTA, and ADMM to see which is fastest. To generate the data, I generated a random $A \in \mathbb{R}^{m \times n}$ with $m < n$, then generated a random sparse vector $x_{*}$, and calculated $b=Ax_*$.

The stopping criteria for all solver was coming within $0.0001$ of the optimal objective function value.

Convergence of ISTA, FISTA, and ADMM with varying stepsizes

Algorithm	Solve Time (sec)
ADMM (rho=50)	0.197
ADMM (rho=100)	0.202
ADMM (rho=10)	1.097
FISTA	1.652
OSQP	2.271
ISTA	8.880

It should be mentioned that the ISTA, FISTA, and ADMM implementations are quite naive and unoptimized, but the OSQP solver is written is pure C. The slowest algorithm by far is ISTA followed by reformulating Lasso as a QP and using OSQP, followed by FISTA, and the fastest algorithm was ADMM. The code to generate the plots can be found here.

Hypothesis Testing

2023-09-01T00:00:00+00:00

Introduction

Lets say you flip a coin ten times and it comes up heads six out of ten times. Would you think this coin is biased? Probably not.

Now lets say you flip a coin a thousand times and it comes up heads six hundred times. You would probably think it is biased. But why?

In both cases the coin comes up heads 60% of the time. How can we quantify our intuition here? We will turn to the world of hypothesis testing to answer this question.

Null and Alternative Hypothesis

The null hypothesis is the statement we are trying develop evidence against and the alternative hypothesis is its complement.

Here our null hypothesis is that the coin is fair. The alternative hypothesis is that the coin is biased.

P-Value

One useful question for us to ask is what is the chance that we see an event this or more extreme due to random chance assuming the coin where unbiased?

The statistical term for this is p-value.

If our p-value is small it means that it is unlikely that we see six hundred heads out of one thousand flips coming up heads if the coin were fair. So a smaller p-value would make us conclude that the coin is indeed biased.

Level of Significance

But how small of a p-value is small enough for us to conclude the coin is biased? In statistics, this value is called $\alpha$ and a typical value for $\alpha$ is 0.05. But what does this $\alpha$ really mean?

It is the chance that we conclude that the coin is biased when in reality it isn’t. This kind of error is called type I error.

So if $p < 0.05$ then we will conclude that the coin is biased, and we know there is a 5% chance that we conclude the coin is biased when in reality it is not.

Example

Let’s dig into the math of computing p-values. By definition, the p-value is the chance that we see six hundred or more heads out of a thousand flips assuming the coin where unbiased.

Let $X$ be a Bernoulli random variable. If $X=0$ then the coin landed on tails, and if $X=1$ the coin landed on heads. Since we are assuming the coin is unbiased we can write the probability mass function, expected value, and variance of $X$ as follows

\[\mathbb{P}[X=0] = 0.5\] \[\mathbb{P}[X=1] = 0.5\] \[\mathbb{E}[X] = 0.5\] \[\mathbb{V}[X] = 0.25\]

The number of heads that we get is the sum of X from 1 to a thousand. We will define this new random variable as $H$. We can write its expected value and variance as follows

\[\mathbb{E}[H] = 500\] \[\mathbb{V}[H] = 250\]

Since H is the sum of a large number of independent and identically distrubuted random variables, we can use the central limit theorem to conclude that in addition to having the expected value and variance written above we know that $H$ is normally distributed. We can then calculate the z-score of flipping 6 million heads as follows:

\[z = \frac{x-\mu}{\sigma} = \frac{600 - \mathbb{E}[H]}{\sqrt{\mathbb{V}[H]}} \approx 6.32\]

To get the p-value from the z-score we use the standard gaussian cumulative density function as follows:

\[p = 2(1 - \phi(z)) \approx 2 \times 10^{-10}\]

This p-value is less than 0.05 so we can conclude that the coin is biased.

Code

Below is some Julia code that conducts a hypothesis test for a coin flip example. The input parameters are the number of heads and the number of coin flips. The source file can be found here. I encourage you to play around with the number of heads and total number of flips to get an intuition for what is statistically significant and what isn’t.

using Distributions

# Input Parameters

num_heads = 60
num_flips = 100
@assert num_heads <= num_flips "Number of heads must be less than or equal to the number of flips"

alpha = 0.05 # Significance level

# Mean and variance of number of heads if the coin were unbiased

mu = 0.5 _ num_flips
var = 0.25 _ num_flips

# Z-Score of our observation of num_heads

z = (num_heads - mu) / sqrt(var)

# Compute p-value

p = 2 \* (1 - cdf(Normal(), abs(z)))

println("p-value: ", p)
if (p < alpha)
println("p < 0.05 so we can reject the null hypothesis and conclude the coin is biased")
else
println("p > 0.05 so we cannot reject the null hypothesis and we cannot conclude that the coin is biased")
end

Convex Solvers

2023-02-03T00:00:00+00:00

Introduction

Convex optimization is a class of optimization concerned with minimizing a convex function over a convex set. One important feature of convex optimization is that any local minimum for a convex problem is the global minimum, this means that the global minimum can be found very quickly. Mathamatically, a convex problem can be written as follows

\[\begin{align*} \min_{x} & \; f(x) \\ \textrm{s.t.} & \; h_{i}(x) = 0, \quad i = 1, \ldots, m \\ & \; g_{i}(x) \leq 0, \quad i = 1, \ldots, p \end{align*}\]

where $x \in \mathbb{R}^n$ is the optimization variable, $f(x)$ is a convex objective function, $h_{i}(x)=0$ are affine equality constraints, and $g_{i}(x) \leq 0$ are convex inequality constraints.

This post focuses on stressing the intuition of different classes of convex solver and provides references for further reading at the end of each section.

Active Set Methods

A constraint is said to be active at a point $x_0$ if $g_i(x_0) = 0$. We can define the optimal active set as the set of all constraints that are active at the optimal solution $x^*$. We can see that all the equality constraints will be in the optimal active set.

Active set methods take advantage of the fact that equality constrainted problems are easier to solve than inequality constrainted problems. These methods start with a guess of the optimal active set and solve this equality constrained subproblem. Then it uses information from the solution of the subproblem, for example the sign of the dual variables, to add and remove constraints from the current guess of the active set.

If a good guess of the optimal active set is known, these methods can be be very fast and only take a handful of iterations. As a result these methods warmstart well and would be advantageous in applications like model predictive control since it is is unlikely that the optimal active set drastically changes between two solves.

The disadvantage of active set methods is that the theoretical worst-case runtime is exponential in the number of constraints since in the worst case, all combinations of constraints must be tested.

One famous example of an active set algorithm is Simplex, which was invented by George Dantzig for Linear Programs (LPs). In Simplex, all iterates are vertices of the feasible set (which is a polytope), however this is not the case for quadratic programs (QPs) or any more complex optimization problem. Another active set solver is qpOASES.

Each iteration of an active set method for QPs solves an equality constrained QP. Now we will see how equality constrained QPs can be solved in a simple way. For equality constrained QPs, the KKT conditions which are necessary and sufficient conditions for optimality are linear and thus they can be solved in one Newton step. We can write the equality constrained QP as follows where $Q>0$.

\[\begin{align*} \min_{x} & \; \frac{1}{2}x^\top Qx + q^\top x \\ \textrm{s.t.} & \; Ax=b\\ \end{align*}\]

we can write the KKT conditions for this problem as follows, where $\lambda$ is a vector of dual variables

\[\begin{align*} Qx+q+A^\top \lambda &= 0 \\ Ax-b &= 0 \end{align*}\]

We can see that this system of equations is linear in the primal and dual variables so we can find the optimal primal and dual solution by solving the following system of equations

\[\begin{bmatrix} Q & A^\top \\ A & 0 \end{bmatrix} \begin{bmatrix} x \\ \lambda \end{bmatrix} = \begin{bmatrix} -q \\ b \end{bmatrix}\]

Thus we can see that solving the equality constrained quadratic program amounts to nothing more than solving a linear system.

To learn more about the details of active set methods reference Chapter 16 Section 5 of Numerical Optimization by Nocedal and Wright.

Interior Point Methods (IPMs)

As the name suggests, interior point methods solve optimization problems in a way that the iterates lie in the interior of the feasible set. There are two main variants of IPMs: primal and primal-dual. In primal IPMs, we only compute iterates of the primal variables, and in primal-dual IPMs we compute iterates of both the primal and dual variables.

Primal IPMs make use of the fact that the following two problems are equivalent:

\[\begin{align*} \min_{x} & \; f(x) \\ \textrm{s.t.} & \; x \in \mathcal{D}\\ \end{align*}\]

where $\mathcal{D}$ is some convex set and

\[\begin{align*} \min_{x} & \; f(x) + \mathcal{I}_{\mathcal{D}}(x)\\ \end{align*}\]

where $\mathcal{I}_{\mathcal{D}}(x)$ is the indicator function on $\mathcal{D}$ which is defined as

\[\mathcal{I}_{\mathcal{D}}(x) = \begin{cases} 0 & \text{if} \; x \in \mathcal{D} \\ \infty & \text{if} \; x \notin \mathcal{D} \end{cases}\]

However, we cannot directly solve the minimization problem with the indicator function since it is nonsmooth at the boundary of the set $\mathcal{D}$, since it jumps from some finite value to infinity. This primal IPMs seek to replace the indicator function with a smooth approximation called a log-barrier function. This log-barrier function is roughly zero when $x \in \mathcal{D}$ and steeply approches infinity when you approach the boundary of $\mathcal{D}$.

This steepness is controlled by a “barrier parameter.” The steeper the barrier function is, the better it approximates the indicator function, but the less smooth and worse conditioned the minimization becomes. Initially the unconstrained problem is solved with a shallow barrier parameter and then is successively solved with steeper and steeper barrier parameters. This barrier can be thought of as a force-field that pushes the iterates away from the boundary of the feasible set and amount that this force field pushes the iterates is controlled by the barrier paramter.

Primal-Dual IPMs take a slightly different approach. They attempt to use Newton’s method on the KKT conditions of the problem with some other fancy tricks such as taking a prediction step then a correction step which allows the algorithm to reuse the factorization of the KKT matrix. This famed trick is called Mehrotra’s Predictor-Corrector. The Primal Dual IPM is famously used by SpaceX in their rocket landing algorithm , .

One large drawback to using IPMs in real-time systems is the fact that they cannot be warmstarted which is a desirable property of real-time solvers.

One example of an IPM is ECOS.

To learn more about primal IPMs reference Chapter 11 in Convex Optimization by Boyd and Vanderberghe and to learn more about primal-dual IPMs reference Chapter 14 and Chapter 16 Section 6 in Nocedal and Wright.

First-Order Methods

Both Active Set methods and IPMs typically rely on “second-order” information. Second order information is the information about the curvature of a function which is given by its second derivative, or for the multivariable case its Hessian. Using second order information allows these methods to converge quickly in few iterations, but each iteration requires the factorization of a large matrix which is a very expensive operation.

The number of floating point operations for matrix factorization scales with $\mathcal{O}(n^3)$ and storing the Hessian scales with $\mathcal{O}(n^2)$ where $n$ is the number of variables in your problem.

First Order methods on the other hand only use first-order information, which is information about the slope of a function which is given by its first derivative, or gradient in the multivariable case. First-order methods do not require matrix factorizations at each iteration and only require Matrix-vector multiplications. The number of floating point operations for matrix-vector multiplication scales with $\mathcal{O}(n^2)$, thus each iteration of a first-order method can be done quicker than an iteration of a second order method, but first order methods require more iterations to converge, since each iteration uses less information.

First order methods are more or less gradient descent algorithms with some modifications to handle constraints such as projections. For extremely large scale problems first order methods are preferable due to the high cost of factorizing and storing large matrices.

All of this performance of first order methods does have some drawbacks. First order methods are extremely sensitive to ill conditioned objectives and badly scaled problem data. Thus an extrememly fast and robust implementation of a first-order method must scale and precondition the problem data.

Some examples of first order solvers are OSQP and SCS.

Summary

Here we will quickly sumamrize the advantages and disadvantages of each method.

Active Set

Advantages: Easy to warmstart, fast if you have a good guess for the active set

Disadvantages: Worse case exponential runtime in the number of constraints, bad for large problems

IPMs

Advantages: Fast and robust for medium sized problems

Disadvantages: Bad for large problems, cannot warmstart, large code footprint

First-Order

Advantages: Small code footprint, good for large problems, can be good for medium sized problems with customization, ease of customization, easy to warmstart

Disadvantages: Highly sensitive to scaling and conditioning so they need scaling and preconditioning