Using the LASSO to Forecast Returns

1. Motivating Example

A Popular Goal. Financial economists have been looking for variables that predict stock returns for as long as there have been financial economists. For some recent examples, think about Jegadeesh and Titman (1993), which shows that a stock’s current returns are predicted by the stock’s returns over the previous $12$ months, Hou (2007), which shows that the current returns of smallest stocks in an industry are predicted by the lagged returns of the largest stocks in the industry, and Cohen and Frazzini (2008), which shows that a stock’s current returns are predicted by the lagged returns of its major customers.

Two-Step Process. When you think about it, finding these sorts of variables actually consists of two separate problems, identification and estimation. First, you have to use your intuition to identify a new predictor, $x_t$ , and then you have to use statistics to estimate this new predictor’s quality,

(1) $\begin{align*} r_{n,t} &= \hat{\theta}_0 + \hat{\theta}_1 \cdot x_{t-1} + \epsilon_{n,t}, \end{align*}$

where $\hat{\theta}_0$ and $\hat{\theta}_1$ are estimated coefficients, $r_{n,t}$ is the return on the $n$ th stock, and $\epsilon_{n,t}$ is the regression residual. If knowing $x_{t-1}$ reveals a lot of information about what a stock’s future returns will be, then $|\hat{\theta}_1|$ and the associated $R^2$ will be large.

Can’t Always Use Intuition. But, modern financial markets are big, fast, and dense. Predictability doesn’t always occur at scales that are easy for people to intuit, making the standard approach to tackling the first problem problematic. For instance, the lagged returns of the Federal Signal Corporation were a significant predictor for more than $70{\scriptstyle \%}$ of all NYSE-listed telecom stocks during a $34$ -minute stretch on October $5$ th, 2010. Can you really fish this particular variable out from the sea of spurious predictors using intuition alone? And, how exactly are you supposed to do this in under $34$ minutes?

Using Statistics Instead. In a recent working paper (link), Mao Ye, Adam Clark-Joseph, and I show how to replace this intuition step with statistics and use the least absolute shrinkage and selection operator (LASSO) to identify rare, short-lived, “sparse” signals in the cross-section of returns. This post uses simulations to show how the LASSO can be used to forecast returns.

2. Using the LASSO

LASSO Definition. The LASSO is a penalized-regression technique that was was introduced in Tibshirani (1996). It simultaneously identifies and estimates the most important coefficients using a far shorter sample period by betting on sparsity—that is, by assuming only a handful of variables actually matter at any point in time. Formally, using the LASSO means solving the problem below,

(2) $\begin{align*} \hat{\boldsymbol \vartheta} &= \underset{{\boldsymbol \vartheta} \in \mathbf{R}^Q}{\mathrm{arg}\,\mathrm{min}} \, \left\{ \, \frac{1}{2 \cdot T} \cdot \sum_{t=1}^T \left(r_t - \vartheta_0 - {\textstyle \sum_{q=1}^Q} \vartheta_q \cdot x_{q,t-1}\right)^2 + \lambda \cdot \sum_{q=1}^Q \left|\vartheta_q\right| \, \right\}, \end{align*}$

where $r_t$ is a stock’s return at time $t$ , $\hat{\boldsymbol \vartheta}$ is a $(Q \times 1)$ -dimensional vector of estimated coefficients, $x_{q,t-1}$ is the value of $q$ th predictor at time $(t-1)$ , $T$ is the number of time periods in the sample, and $\lambda$ is a penalty parameter. Equation (2) looks complicated at first, but it’s not. It’s a simple extension of an OLS regression. In fact, if you ignore the right-most term—the penalty function, $\lambda \cdot \sum_q \left|\vartheta_q\right|$ —then this optimization problem would simply be an OLS regression.

Penalty Function. But, it’s this penalty function that’s the secret to the LASSO’s success, allowing the estimator to give preferential treatment to the largest coefficients and completely ignore the smaller ones. To better understand how the LASSO does this, consider the solution to Equation (2) when the right-hand-side variables are uncorrelated and have unit variance:

(3) $\begin{align*} \hat{\vartheta}_q &= \mathrm{sgn}[\hat{\theta}_q] \cdot (|\hat{\theta}_q| - \lambda)_+. \end{align*}$

Here, $\hat{\theta}_q$ represents what the standard OLS coefficient would have been if we had an infinite amount of data, $\mathrm{sgn}[x] = \sfrac{x}{|x|}$ , and $(x)_+ = \max\{0,\,x\}$ . On one hand, this solution means that, if OLS would have estimated a large coefficient, $|\hat{\theta}_q| \gg \lambda$ , then the LASSO is going to deliver a similar estimate, $\hat{\vartheta}_q \approx \hat{\theta}_q$ . On the other hand, the solution implies that, if OLS would have estimated a sufficiently small coefficient, $|\hat{\theta}_q| < \lambda$ , then the LASSO is going to pick $\hat{\vartheta}_q = 0$ . Because the LASSO can set all but a handful of coefficients to zero, it can be used to identify the most important predictors even when the sample length is much shorter than the number of possible predictors, $T \ll Q$ . Morally speaking, if only $K \ll Q$ of the predictors are non-zero, then you should only need a few more than $K$ observations to select and then estimate the size of these few important coefficients.

3. Simulation Analysis

I run $1,000$ simulations to show how to use the LASSO to forecast future returns. You can find all of the relevant code here.

Data Simulation. Each simulation involves generating returns for $Q = 100$ stocks for $T = 1,150$ periods. Each period, the returns of all $Q=100$ stocks are governed by the returns of a subset of $K=5$ stocks, $\mathcal{K}_t$ , together with an idiosyncratic shock,

(4) $\begin{align*} r_{q,t} &= 0.15 \cdot \sum_{q' \in \mathcal{K}_t} r_{q',t-1} + 0.001 \cdot \epsilon_{q,t}, \end{align*}$

where $\epsilon_{q,t} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1)$ . This cast of $K = 5$ sparse signals changes over time, leading to the time subscript on $\mathcal{K}_t$ . Specifically, I assume that there is a $1{\scriptstyle \%}$ chance that each signal changes every period, so each signal lasts lasts $\sfrac{(1 - 0.01)}{0.01} = 99$ periods on average.

Fitting Models to the Data. For each period from $t=151$ to $t=1,150$ , I estimate the LASSO on the first stock, $q=1$ , as defined in Equation (2) using the previous $T=50$ periods of data where the $Q$ possible predictors are the $Q=100$ stocks. This means using $T=50$ time periods to estimate a model with $Q=100$ potential right-hand-side variables. As useful benchmarks, I also estimate the autoregressive model from Equation (1) and an oracle regression. In this specification, I estimate an OLS regression with the $K=5$ true predictors as the right-hand-side variables. Obviously, in the real-world you don’t know what the true predictors are, but this specification gives an estimate of the best fit you could achieve. After fitting each model to the previous $50$ periods of data, I then make an out-of-sample forecast in the $51$ st period.

Forecasting Regressions. I then check how closely these forecasts line up with the realized returns of the first asset by analyzing the adjusted $R^2$ statistics from a bunch of forecasting regressions. For example, I take the LASSO’s return forecast in periods $t=151$ to $t=1,150$ and estimate the regression below,

(5) $\begin{align*} r_{1,t+1} &= \alpha + \beta \times \left( \frac{f_{1,t}^{\scriptscriptstyle \mathrm{LASSO}} - \mu^{\scriptscriptstyle \mathrm{LASSO}}}{\sigma^{\scriptscriptstyle \mathrm{LASSO}}} \right) + \varepsilon_{1,t+1}, \end{align*}$

where $\alpha$ and $\beta$ are estimated coefficients, $r_{1,t+1}$ denotes the first stock’s realized return in period $(t+1)$ , $f_{1,t}^{\scriptscriptstyle \mathrm{LASSO}}$ denotes the LASSO’s forecast of the first stock’s return in minute $(t+1)$ , $\mu^{\scriptscriptstyle \mathrm{LASSO}}$ and $\sigma^{\scriptscriptstyle \mathrm{LASSO}}$ represent the mean and standard deviation of this out-of-sample forecast from period $t=151$ to $t=1,150$ , and $\varepsilon_{1,t+1}$ is the regression residual. The figure below shows that the average adjusted- $R^2$ statistic from $1,000$ simulations is $4.40{\scriptstyle \%}$ for the LASSO; whereas, this statistic is only $1.29{\scriptstyle \%}$ when making your return forecasts using an autoregressive model,

(6) $\begin{align*} r_{1,t+1} &= \alpha + \beta \times \left( \frac{f_{1,t}^{\scriptscriptstyle \mathrm{OLS}} - \mu^{\scriptscriptstyle \mathrm{OLS}}}{\sigma^{\scriptscriptstyle \mathrm{OLS}}} \right) + \varepsilon_{1,t+1}. \end{align*}$

4. Tuning Parameter

Penalty Parameter Choice. Fitting the LASSO to the data involves selecting a penalty parameter, $\lambda$ . I do this by selecting the penalty parameter that has the highest out-of-sample forecasting $R^2$ during the first $100$ periods of the data. This is why the forecasting regressions above only use data starting at $t=151$ instead of $t=51$ . The figure below shows the distribution of penalty parameter choices across the $1,000$ simulations. The discrete $0.0005$ jumps come from the discrete grid of possible $\lambda$ s that I considered when running the code.

Number of Predictors. Finally, if you look at the panel labeled “Oracle” in the adjusted $R^2$ figure, you’ll notice that the LASSO’s out-of-sample forecasting power is about a third of the true model’s forecasting power, $\sfrac{4.40}{12.84} = 0.34$ . This is because the LASSO doesn’t do a perfect job of picking out the $K=5$ sparse signals. The right panel of the figure below shows that the LASSO usually only picks out the most important of these $K=5$ signals. What’s more, the left panel shows that the LASSO also locks onto lots of spurious signals. This result suggests that you might be able to improve the LASSO’s forecasting power by choosing a higher penalty parameter, $\lambda$ .

5. When Does It Fail?

Placebo Tests. I conclude this post by looking at two alternative simulations where the LASSO shouldn’t add any forecasting power. In the first alternative setting, there are no shocks. That is, the returns for the $Q=100$ stocks are simulated using the model below,

(7) $\begin{align*} r_{q,t} &= 0.00 \cdot \sum_{q' \in \mathcal{K}_t} r_{q',t-1} + \sigma \cdot \epsilon_{q,t}. \end{align*}$

In the second setting, there are too many shocks: $K =75$ . The figures below show that, in both these settings, the LASSO doesn’t add any forecasting power. Thus, running these simulations offers a pair of nice placebo tests showing that the LASSO really is picking up sparse signals in the cross-section of returns.