Imagine that we’re trying to predict the cross-section of expected returns, and we’ve got a sneaking suspicion that might be a good predictor. So, we regress today’s returns on to see if our hunch is right,
The logic is straightforward. If explains enough of the variation in today’s returns, then must be a good predictor and we should include it in our model of tomorrow’s returns, .
But, how much variation is “enough variation”? After all, even if doesn’t actually predict tomorrow’s returns, we’re still going to fit today’s returns better if we use an additional right-hand-side variable,
The effect is mechanical. If we want to explain all of the variation in today’s returns, then all we have to do is include right-hand-side variables in our OLS regression. With linearly independent right-hand-side variables we can always perfectly predict stock returns, no matter what variables we choose.
The Bayesian information criterion (BIC) tells us that we should include as a right-hand-side variable if it explains at least of the residual variation,
But, where does this penalty come from? And, why is following this rule the Bayesian thing to do? Bayesian updating involves learning about a parameter value by combining prior beliefs with evidence from realized data. So, what parameter are we learning about when using the Bayesian information criterion? And, what are our priors beliefs? These questions are the topic of today’s post.
Instead of diving directly into our predictor-selection problem (should we include in our model?), let’s pause for a second and solve our parameter-estimation problem (how should we estimate the coefficient on ?). Suppose the data-generating process for returns is
where , , and is normalized so that . For simplicity, let’s also assume that in the analysis below.
If we see returns from this data-generating process, , then we can estimate by choosing the parameter value that would maximize the posterior probability of realizing these returns:
This is known as maximum a posteriori (MAP) estimation, and the second equality in the expression above points out how we can either maximize the posterior probability or minimize of this function,
We can think about as the average improbability of the realized returns given .
So, what is this answer? Because and , we know that
where the first line is and the second line is . What’s more, because we’re specifically choosing to minimize , we also know that
And, solving this first-order condition for tells us exactly how to estimate :
Now that we’ve seen the solution to our parameter-estimation problem, let’s get back to solving our predictor-selection problem. Should we include in our predictive model of tomorrow’s returns? It turns out that answering this question means learning about the prior variance of . Is is equally likely to take on any value, ? Or, should we assume that regardless of the evidence, ?
To see where the first choice comes from, let’s think about the priors we’re implicitly adopting when we include in our predictive model. Since , this means looking for a such that . Inspecting the solution to our parameter-estimation problem reveals that
So, by including , we’re adopting an agnostic prior that is equally likely to be any value under the sun.
To see where the second choice comes from, let’s think about the priors we’re implicitly adopting when we exclude from our predictive model. This means looking for a such that regardless of the realized data, . Again, inspecting the formula for reveals that
So, by excluding , we’re adopting a religious prior that regardless of any new evidence.
Thus, when we decide whether to include in our predictive model, what we’re really doing is learning about our priors. So, after seeing returns, , we can decide whether to include in our predictive model by choosing the prior variance, , that maximizes the posterior probability of realizing these returns,
where the second equality in the expression above points out how we can either maximize the posterior probability or minimize of this function—i.e., its average improbability. Either way, if we estimate , then we should include ; whereas, if we estimate , then we shouldn’t.
4. Why log(N)/N?
The posterior probability of the realized returns given our choice of priors is given by
In this section, we’re going to see how to evaluate this integral. And, in the process, we’re going to see precisely where that penalty term in the Bayesian information criterion comes from.
Here’s the key insight in plain English. The realized returns are affected by noise shocks. By definition, excluding from our predictive model means that we aren’t learning about from the realized returns, so there’s no way for these noise shocks to affect either our estimate of or our posterior-probability calculations. By contrast, if we include in our predictive model, then we are learning about from the realized returns, so these noise shocks will distort both our estimate of and our posterior-probability calculations. The distortion caused by these noise shocks are going to be the source of the penalty term in the Bayesian information criterion.
Now, here’s the same insight in Mathese. Take a look at the Taylor expansion of around ,
There’s no first-order term because is chosen to minimize , and there are no higher-order terms because both and are normally distributed. From the formula for we can calculate that
Recall that measures the average improbability of realizing given that . So, if for a given choice of priors, then having any is infinitely improbable under those priors. And, this is exactly what we find when we exclude from our predictive model, . By contrast, if we include in our predictive model, then , meaning that we are willing to entertain the idea that due to distortions caused by the noise shocks.
To see why these distortions warrant a penalty, all we have to do then is evaluate the integral. First, let’s think about the case where we exclude from our predictive model. We just saw that, if , then we are unwilling to consider any parameter values besides . So, the integral equation for our posteriors given that simplifies to
This means that the average improbability of realizing given the priors is given by
To calculate our posterior beliefs when we include , let’s use this Taylor expansion around again,
The first term is the probability of observing the realized returns assuming that . The second term is a penalty that accounts for the fact that might be different from the estimated in finite samples. Due to the central-limit theorem, this difference between and is going to shrink at a rate of :
So, the average improbability of realizing given the priors is given by
where is big-“O” notation denoting terms that shrink faster than as .
Bringing everything together, hopefully it’s now clear why we can decide whether to include in our predictive model by checking whether
The penalty term accounts for the fact that you’re going to be overfitting the data in sample when you include more right-hand-side variables. This criterion was first proposed in Schwarz (1978), who showed that the criterion becomes exact as . The Bayesian information criterion is often written as an optimization problem as well:
Both ways of writing down the criterion are the same. They just look different due to formatting. There is one interesting idea that pops out of writing down the Bayesian information criterion as a optimization problem. Solving for the suggests that you should completely ignore any predictors with sufficiently small OLS coefficients: