Comparing optimization algorithms for Lasso
Lasso regression is and important regularization technique for linear regression that can also perform variable selection. What this means is that solutions to the lasso problem tend to be sparse (contain zeros) which allows us to rule out certain independent variables in our model. A great resource to familiarize yourself with lasso is this video.
Lasso is of incredible importance in statistics, signal processing, compressed sensing, and image processing. In this post we will look at a variety of optimization technique for solving the lasso problem.
The lasso optimization problem is an unconstrained optimization problem which can be written as follows:
where
A first thought to solve this problem might be gradient descent, however the objective function is non-smooth (due to the l1 penalty), so we cannot use gradient descent. A second thought is to use subgradient descent, which is a generalization of gradient descent to non-smooth functions. This would work, but subgradient descent has an extremely slow worst-case convergence rate of
One of the easiest things to do would be reformulating this problem as a Quadratic Program (QP) as follows:
To see how this was done reference the Mosek Modeling Cookbook
We can then feed this into a QP solver such as OSQP and then get an answer. This works, but it feels wasteful to turn an unconstrained problem into a constrained one and then use a generic QP solver. There should be better algorithms that exploit the structure of our problem where we have a smooth plus a non-smooth term.
ISTA or iterative shrinking threshold algorithm is an application of the proximal gradient method to the Lasso problem.
The proximal gradient method solves problems of the following form, where
The algorithm looks as follows:
where
The proximal operator
You can think of the proximal operator as returning a point which balances minimizing the function and staying close to the current point. The proximal operator for many function are well known in closed form. For more information on proximal operators and algorithms using proximal operators, refer to
The idea behind the proximal gradient method is to perform a gradient descent step assuming we are just going to be minimizing the smooth function
Alternating between these two steps, we eventually minimize our original objective function.
For Lasso we will take,
It can be shown that the proximal operator for the l1 norm is the soft threshold operator
We can now write out the ISTA iterates as follows
Where
It can be shown that this algorithm has a worst-case convergence rate of
ISTA was used for a while, but many researchers noticed that it can be painfully slow to converge. In 2009, Beck and Teboulle introduced FISTA (Fast Iterative Shrinking Threshold Algorithm) where they used momentum to accelerate ISTA and were able to achieve the worst-case convergence rate of
This algorithm can be written as follows
The final algorithm we will consider for Lasso is the Alternating Direction Method of Multiplers (ADMM). This algorithm was introduced in the mid-1970s, but became popular again after Stephen Boyd et al published their paper in 2011
The iterates of the algorithm are as follows:
where
For the case of Lasso, we have
and the ADMM iterates become
Here,
Now we will test the QP version of Lasso, ISTA, FISTA, and ADMM to see which is fastest. To generate the data, I generated a random
The stopping criteria for all solver was coming within
Algorithm | Solve Time (sec) |
---|---|
ADMM (rho=50) | 0.197 |
ADMM (rho=100) | 0.202 |
ADMM (rho=10) | 1.097 |
FISTA | 1.652 |
OSQP | 2.271 |
ISTA | 8.880 |
It should be mentioned that the ISTA, FISTA, and ADMM implementations are quite naive and unoptimized, but the OSQP solver is written is pure C. The slowest algorithm by far is ISTA followed by reformulating Lasso as a QP and using OSQP, followed by FISTA, and the fastest algorithm was ADMM. The code to generate the plots can be found here.