Intuition Behind the Bayesian LASSO

1. Motivating Question

Imagine you’ve just seen Apple’s most recent return, $r$ , which is Apple’s long-run expected return, $\mu^\star$ , plus some random noise, $\epsilon \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \, 1)$ :

(1) $\begin{align*} r &= \mu^\star + \epsilon. \end{align*}$

You want to use this realized return, $r$ , to estimate Apple’s long-run expected return, $\mu^\star$ . The LASSO is a popular way to solve this problem. The LASSO estimates Apple’s long-run expected return, $\mu^\star$ , by choosing a $\hat{\mu}$ that’s as close as possible to the realized $r$ while taking into account an absolute-value penalty,

(2) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, {\textstyle \frac{1}{2}} \cdot (r - \mu)^2 + \lambda \cdot |\mu| \, \right\}, \end{align*}$

where $\lambda \geq 0$ is the strength of this penalty. If you use the LASSO, then you’ll estimate:

(3) $\begin{align*} \hat{\mu}(r) = \begin{cases} \mathrm{Sign}(r) \cdot (|r| - \lambda) &\text{if } |r| > \lambda, \text{ and} \\ 0 &\text{if } |r| \leq \lambda. \end{cases} \end{align*}$

Suppose that you chose $\lambda = 1.0{\scriptstyle \%}$ . If Apple’s most recent stock return was $r = 0.3{\scriptstyle \%}$ , then the LASSO will pick $\hat{\mu} = 0{\scriptstyle \%}$ . And, if Apple’s most recent stock return was $r = -0.7{\scriptstyle \%}$ , then the LASSO will still pick $\hat{\mu} = 0{\scriptstyle \%}$ . But, if Apple’s most recent stock return was $r = 1.2{\scriptstyle \%}$ , then the LASSO will give an estimate of $\hat{\mu} = 0.2{\scriptstyle \%}$ .

The LASSO seems like it’s throwing away lots of information. In the example above, you didn’t adjust your estimate of Apple’s long-run expected return at all when you saw returns of $0.3{\scriptstyle \%}$ and $-0.7{\scriptstyle \%}$ . So, it’s surprising that, if Apple’s long-run expected return, $\mu^\star$ , was drawn from a Laplace distribution,

(4) $\begin{align*} \mathrm{Pr}( \mu^\star = \mu ) = {\textstyle \frac{\lambda}{2}} \cdot e^{- \lambda \cdot |\mu|}, \end{align*}$

then using the LASSO to estimate $\mu^\star$ would be the Bayesian thing to do (Park and Casella, 2008). If $\mu^\star \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Laplace}(\lambda = 1.0{\scriptstyle \%})$ , then it’s correct to just ignore any return smaller than $1.0{\scriptstyle \%}$ when estimating $\mu^\star$ .

Why is this? If you cross your eyes and squint, you can sort of see why the Laplace distribution might be linked to the LASSO. Both use the Greek-letter $\lambda$ and involve $|\mu|$ . But, lot’s of distributions use the absolute-value operator (e.g., the Wishart distribution). And, there are lots of Greek letters. That’s how letters work. I could just as easily have called the scale parameter in the Laplace distribution $\alpha$ , $\beta$ , or $\gamma$ instead of $\lambda$ . So, what’s special about the Laplace distribution? What is it about the Laplace distribution that makes using the LASSO correct? How can it ever be Bayesian to throw information away?

2. Simpler Problem

To answer these questions, let’s start by looking at a simpler inference problem. Suppose that Apple’s long-run expected return is drawn from a Normal distribution, $\mu^\star \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \, \sigma_\mu^2)$ :

(5) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu) &= {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_\mu^2} \cdot (\mu - 0)^2}. \end{align*}$

If $\mu^\star$ is drawn from a Normal distribution, then you definitely don’t want to use the LASSO.

Bayes’ rule tells you that:

(6) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu|r) &\propto \mathrm{Pr}(r|\mu) \times \mathrm{Pr}(\mu) \\ &= \left\{ \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2} \cdot (r - \mu)^2} \, \right\} \times \left\{ \, {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_\mu^2} \cdot (\mu - 0)^2} \, \right\}. \end{align*}$

$\mathrm{Pr}(\mu^\star = \mu|r)$ is the posterior likelihood that Apple’s long-run expected return is $\mu$ given that you’ve just seen a realized return of $r$ . $\mathrm{Pr}(r|\mu)$ is the probability that Apple realizes a return of $r$ if its long-run expected return is $\mu$ . And, $\mathrm{Pr}(\mu)$ is the probability that Apple’s long-run expected return is $\mu^\star = \mu$ in the first place.

You want to choose the $\hat{\mu}$ that maximizes this posterior likelihood $\mathrm{Pr}(\mu^\star = \mu|r)$ , or equivalently, that minimizes the negative of the log of this posterior likelihood:

(7) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\sigma_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

When Apple’s long-run expected return is drawn from a Normal distribution, you want to choose a $\hat{\mu}$ that’s as close as possible to $r$ while taking into account a quadratic penalty not an absolute-value penalty. When $\mu^\star$ is drawn from a Normal distribution, you’re never going to ignore small realized returns.

On one hand, you could pick a $\hat{\mu}$ that’s really close to Apple’s recent return to make $(r - \hat{\mu})^2$ small. On the other hand, you could pick a $\hat{\mu}$ close to $0$ to make $(\sfrac{1}{\sigma_\mu^2}) \cdot (\hat{\mu} - 0)^2$ small. Your priors determine what you do:

(8) $\begin{align*} \hat{\mu}(r) = \left( {\textstyle \frac{\sigma_\mu^2}{1.0{\scriptstyle \%}^2 + \sigma_\mu^2}} \right) \cdot r. \end{align*}$

If you don’t have very strong priors about Apple’s long-run expected return ( $\sigma_\mu \gg 1.0{\scriptstyle \%}$ ), then you’re going to pick $\hat{\mu} \approx r$ since $\sfrac{\sigma_\mu^2}{(1.0{\scriptstyle \%}^2 + \sigma_\mu^2)} \approx 1$ . By contrast, if you have very strong priors ( $\sigma_\mu \ll 1.0{\scriptstyle \%}$ ), then you’re going to pick $\hat{\mu} \approx 0{\scriptstyle \%}$ since $\sfrac{\sigma_\mu^2}{(1.0{\scriptstyle \%}^2 + \sigma_\mu^2)} \approx 0$ . To illustrate, suppose that you’re really sure that Apple’s long-run expected return is close to $0{\scriptstyle \%}$ with $\sigma_{\mu} = 0.1{\scriptstyle \%}$ . Then, if you see Apple realize a return of $r = 6.0{\scriptstyle \%}$ , you’re going to think that this realization was probably due to a positive random shock, $\epsilon = 5.94{\scriptstyle \%}$ , and only pick $\hat{\mu} = 0.06{\scriptstyle \%}$ .

3. Mixture Model

Now, let’s tweak the setup slightly. Suppose that, instead of being constant, the standard deviation of Apple’s long-run expected return can be either high or low,

(9) $\begin{align*} \overline{\sigma}_{\mu} \gg \sigma_{\epsilon} = 1.0{\scriptstyle \%} \gg \underline{\sigma}_{\mu}, \end{align*}$

with the high value much larger than $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ and the low value much smaller than $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ . Each case equally likely: $\mathrm{Pr}(\sigma_\mu = \overline{\sigma}_{\mu} ) = \mathrm{Pr}( \sigma_\mu = \underline{\sigma}_{\mu} ) = \sfrac{1}{2}$ . It turns out that you’re going to behave a lot like someone using the LASSO when you estimate Apple’s long-run expected return in this mixture model.

Regardless of the model, if you want to estimate Apple’s long-run expected return, then you have to use Bayes’ rule. And, just like before, Bayes’ rule tells you that:

(10) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu|r) \propto \mathrm{Pr}(r|\mu) \times \mathrm{Pr}(\mu). \end{align*}$

But, now there’s an extra layer to the problem. The standard deviation of Apple’s long-run expected return can either be high or low,

(11) $\begin{align*} \mathrm{Pr}(\mu) = {\textstyle \frac{1}{2}} \cdot \mathrm{Pr}(\mu|\sigma_\mu = \overline{\sigma}_\mu) + {\textstyle \frac{1}{2}} \cdot \mathrm{Pr}(\mu|\sigma_\mu = \underline{\sigma}_\mu). \end{align*}$

You don’t know which it is. But, if you knew that $\sigma_{\mu} = \overline{\sigma}_{\mu} = 10{\scriptstyle \%} \gg 1.0{\scriptstyle \%}$ , then you’d pick $\hat{\mu} = (\sfrac{100}{101}) \cdot r$ . Whereas, if you knew that $\sigma_{\mu} = \overline{\sigma}_{\mu} = 0.10{\scriptstyle \%} \ll 1.0{\scriptstyle \%}$ , then you’d pick $\hat{\mu} = (\sfrac{1}{101}) \cdot r$ . Your estimate when $\sigma_\mu = \overline{\sigma}_{\mu}$ is going to really different from your estimate when $\sigma_\mu = \underline{\sigma}_{\mu}$ .

Let’s flesh out what this means. You want to estimate Apple’s long-run expected return, $\mu^\star$ , by choosing the $\hat{\mu}$ that maximizes the posterior likelihood $\mathrm{Pr}(\mu^\star = \mu|r)$ ,

(12) $\begin{align*} \hat{\mu}(r) = \arg \max_{\mu \in \mathrm{R}} \left\{ \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2} \cdot (r - \mu)^2} \, \right\} \times \left\{ \, {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{1}{\overline{\sigma}_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \overline{\sigma}_\mu^2} \cdot (\mu - 0)^2} + {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{1}{\underline{\sigma}_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \underline{\sigma}_\mu^2} \cdot (\mu - 0)^2} \, \right\}. \end{align*}$

It’s hard to solve for $\hat{\mu}(r)$ analytically when $\overline{\sigma}_{\mu}$ and $\underline{\sigma}_{\mu}$ can take on arbitrary values, but the assumption that $\overline{\sigma}_{\mu} \gg 1.0{\scriptstyle \%} \gg \underline{\sigma}_{\mu}$ simplifies things nicely. And, the resulting analysis reveals why you’re going to do something LASSO-esque when learning about Apple’s long-run expected return in this mixture model.

There are $2$ cases. First, consider the case where Apple realizes a really big return, $|r| \gg 1.0{\scriptstyle \%}$ . This really big return would be really unlikely if $\sigma_\mu = \underline{\sigma}_\mu$ because $\underline{\sigma}_\mu \ll 1.0{\scriptstyle \%}$ is really small. So, you can safely assume that $\sigma_\mu = \overline{\sigma}_{\mu}$ and just solve the optimization problem from Section 2:

(13) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\overline{\sigma}_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

But, as we saw in Section 2 that, if your priors are really weak ( $\overline{\sigma}_\mu \gg 1.0{\scriptstyle \%}$ ), then you should ignore them since $\sfrac{\overline{\sigma}_\mu^2}{(1.0{\scriptstyle \%}^2 + \overline{\sigma}_\mu^2)} \approx 1$ . So, you’re going to set $\hat{\mu}(r) \approx r$ whenever $|r| \gg 1.0{\scriptstyle \%}$ , just like someone using the LASSO.

Now, consider the other case where Apple realizes a really small return, $|r| \ll 1.0{\scriptstyle \%}$ . Again, this really small return would be really unlikely if $\sigma_\mu = \overline{\sigma}_\mu$ because $\overline{\sigma}_\mu \gg 1.0{\scriptstyle \%}$ is really big. So, you can assume that $\sigma_\mu = \underline{\sigma}_{\mu}$ and just solve the optimization problem:

(14) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\underline{\sigma}_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

But, now the opposite logic holds. If your priors are really strong ( $\underline{\sigma}_\mu \ll 1.0{\scriptstyle \%}$ ), then you should ignore $r$ since $\sfrac{\underline{\sigma}_\mu^2}{(1.0{\scriptstyle \%}^2 + \underline{\sigma}_\mu^2)} \approx 0$ . So, you’re going to set $\hat{\mu}(r) \approx 0$ whenever $|r| \ll 1.0{\scriptstyle \%}$ . This is the LASSO’s dead zone!

The figure below shows that, as the high and low standard deviations get more extreme, you’re going to behave more and more like someone using the LASSO when learning about Apple’s long-run expected return in this mixture model. But, the insight is more general than that. You’re going to behave like someone using the LASSO any time a small realized return, $r$ , tells you that you should be using stronger priors about Apple’s long-run expected return, $\mu^\star$ .

4. Laplace Distribution

If Apple’s long-run expected return is drawn from a Laplace distribution, then you face an estimation problem just like the one in the mixture model above. Andrews and Mallows (1974) shows that a Laplace distribution can be re-written as the weighted average of Normal distributions with different standard deviations,

(15) $\begin{align*} {\textstyle \frac{\lambda}{2}} \cdot e^{- \lambda \cdot |\mu|} = \int_0^\infty \, \left\{ \, {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_{\mu}^2} \cdot (\mu - 0)^2} \, \right\} \times \left\{ \, {\textstyle \frac{\lambda^2}{2}} \cdot e^{- \frac{\lambda^2}{2} \cdot \sigma_{\mu}^2} \, \right\} \times \mathrm{d}\sigma_\mu, \end{align*}$

where the weights follow an Exponential distribution. The Exponential distribution has a really fat tail. If the standard deviation of Apple’s long-run expected return is distributed $\sigma_{\mu} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Exponential}(\lambda^2)$ , then these standard deviations could be either really large or really small. We just saw that this is exactly what needs to happen for a LASSO-like estimation strategy to be optimal. There are lots of distributions for $\sigma_{\mu}$ that have this property—we just saw another one above. But, if you use $\sigma_{\mu} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Exponential}(\lambda^2)$ , then the probabilities of realizing large and small values of $\sigma_\mu$ line up in such a way that it’s precisely optimal to use the LASSO.

In the original paper, there are a ton of extra hyper-parameters. For example, $\sigma_{\epsilon}$ is a random variable. This clearly isn’t necessary. You just need the standard deviation of Apple’s long-run expected return to fluctuate wildly around $\sigma_{\epsilon}$ . You can get a situation where the LASSO is really close to being optimal with just $\overline{\sigma}_{\mu} \gg \sigma_{\epsilon} \gg \underline{\sigma}_{\mu}$ .

Also, in the original paper, there’s a lengthy discussion about properly “conditioning on $\sigma_{\epsilon}$ .” The authors include this bizarre example of how the posterior distribution of $\hat{\mu}(r)$ might not be unimodal if you don’t condition on $\sigma_{\epsilon}$ that, for me anyways, always seems to come out of left field. And, textbooks typically brush this point under the rug, calling it a technical conditions. But, the analysis above shows that it’s not just a technical condition. It’s actually really important!

To see why, consider estimating Apple’s long-run expected return in a mixture model with

(16) $\begin{align*} \overline{\sigma}_{\mu} = 10{\scriptstyle \%} \gg \sigma_{\epsilon} = \underline{\sigma}_{\mu} = 0.10{\scriptstyle \%}. \end{align*}$

The only difference from before is that $\sigma_{\epsilon} = 0.10{\scriptstyle \%}$ instead of $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ . If $\sigma_{\epsilon}$ isn’t sufficiently large relative to $\underline{\sigma}_{\mu}$ , then you’re never going to ignore the Apple’s realized return when $|r|$ is small. With these new numbers, $\hat{\mu}(0.50{\scriptstyle \%}) = \sfrac{0.10{\scriptstyle \%}^2}{(0.10{\scriptstyle \%}^2 + 0.10{\scriptstyle \%}^2)} \cdot 0.50{\scriptstyle \%} = 0.25{\scriptstyle \%}$ rather than $0.005{\scriptstyle \%}$ . When choosing a distribution for $\sigma_\mu$ , you’ve got to make sure that the high standard-deviation outcomes are big enough and the low standard-deviation outcomes are small enough relative to $\sigma_{\epsilon}$ . Otherwise, a LASSO-like estimation strategy can’t be optimal.

FYI: Here’s the code to create the figures.