Traders are constantly looking for variables that predict returns. If is the only candidate variable traders are considering, then it’s easy to use the Bayesian information criterion to check whether predicts returns. Previously, I showed that using the univariate version of the Bayesian information criterion means solving
after standardizing things so that and . If the solution is some , then predicts returns. Notation: Parameters with hats are in-sample estimates. e.g., if , then .
But, what if there’s more than variable? There’s an obvious multivariate extension of (●):
So, you might guess that it’d be easy to check which subset of variables predicts returns by evaluating (⣿). But, it’s not. To evaluate the multivariate version of the Bayesian information criterion, traders would have to check different parameter values. That’s a combinatorial nightmare when . Thus, traders can’t take a strictly Bayesian approach to variable selection when there are lots of variables to choose from.
Why is evaluating (⣿) so hard? That’s the topic of today’s post.
2. Non-Convex Problem
Let’s start by looking at what makes (●) so easy. The key insight is that you face the same-sized penalty no matter what you choose when solving the univariate version of the Bayesian information criterion:
So, if you’re going set , then you might as well choose the value that minimizes your prediction error:
Thus, to evaluate (●), all you have to do is check parameter values, and , and see which one gives a smaller result. Practically speaking, this means running an OLS regression, , and checking whether or not the penalized residual variance, , is smaller than the raw return variance, .
Most explanations for why (⣿) is hard to evaluate focus on the fact that (⣿) is a non-convex optimization problem (e.g., see here and here). But, the univariate version of the Bayesian information criterion is also a non-convex optimization problem. Just look at the region around in the figure to the left, which shows the objective function from (●). So, non-convexity can only part of the explanation for why (⣿) is hard to evaluate. Increasing the number of variables must be add a missing ingredient.
3. The Missing Ingredient
If there are many variables to consider, then these variables can be correlated. Correlation. This is the missing ingredient that makes evaluating (⣿) hard. Let’s look at a numerical example to see why.
Suppose there are only stocks and variables. The diagram above summarizes the correlation structure between these variables and returns. The red bar is the total variation in returns. The blue bars represent the portion of this variation that’s related to each of the variables. If you can draw a vertical line through a pair of bars (i.e., the bars overlap), then the associated variables are correlated. So, because the first blue bars don’t overlap, and are perfectly uncorrelated in-sample:
Whereas, because the first blue bars both overlap the third, and are both correlated with :
Finally, longer overlaps denote larger correlations. So, is the single best predictor of returns since the third blue bar has the longest overlap with the top red bar:
And, this creates a problem. If you had to pick only variable to predict returns, then you’d pick :
But, is actually the subset of variables that minimizes (⣿) in this example:
In other words, the variable that best predicts returns on its own isn’t part of the subset of variables that collectively best predict returns. In other examples it might be, but there’s no quick way to figure out which kind of example we’re in because (⣿) is a non-convex optimization problem. Until you actually plug into (⣿), there’s absolutely no reason to suspect that either variable belongs to subset that minimizes (⣿). Think about it. and are both worse choices than on their own. And, if you start with and add either variable, things get even worse:
With many correlated variables, you can never tell how close you are to the subset of variables that best predicts returns. To evaluate (⣿), you’ve got to check all possible combinations. There are no shortcuts.
4. Birthday-Paradox Math
If correlation makes it hard to evaluate (⣿), then shouldn’t we be able to fix the problem by only considering independent variables? Yes… but only in a fairytale world where there are an infinite number of stocks, . The problem is unavoidable in the real world where there are almost as many candidate variables as there are stocks because independent variables are still going to be correlated in finite samples.
Suppose there are independent variables that might predict returns. Although these variables are independent, they won’t be perfectly uncorrelated in finite samples. So, let’s characterize the maximum in-sample correlation between any pair of variables. After standardizing each variable so that and , the in-sample correlation between and when is roughly:
Although , our estimates won’t be exactly zero in finite samples.
Since the normal distribution is symmetric, the probability that and have an in-sample correlation more extreme than is:
So, since there are pairs of variables, we know that the probability that no pair has a correlation more extreme than is:
Here’s the punchline. Because shows up as an exponent in the equation above, the probability that all pairwise in-sample correlations happen to be really small is going to shrink exponentially fast as traders consider more and more variables. This means that finite-sample effects are always going to make evaluating (⣿) computationally intractable in real-world settings with many variables, even when the variables are truly uncorrelated as . e.g., the in-sample correlation of from the previous example might seem like an unreasonably high number for independent random variables, something that only happens when . But, the figure above shows that even when there are stocks, there’s still a chance of observing an in-sample correlation of at least when considering candidate variables.
5. What This Means
Market efficiency has been an “organizing principle for 30 years of empirical work” in academic finance. The principle is based on on a negative feedback loop: predictable returns suggest profitable trading strategies, but implementing these strategies eliminates the initial predictability. So, if there are enough really smart traders, then they’re going to find the subset of variables that predicts returns and eliminate this predictability. That’s market efficiency.
Even for really smart traders, finding the subset of variables that predicts returns is hard. And, this problem gets harder when there are more candidate variables to choose from. But, while researchers have thought about this problem in the past, they’ve primarily focused on the dangers of p-hacking (e.g., see here and here). If traders regress returns on different variables, then they should expect that of these regressions is going to produce a statistically significant coefficient with a even if none of the variables actually predicts returns. So, researchers have focused on correcting p-values.
But, this misses the point. Searching for the subset of variables that predicts returns isn’t hard because you have to adjust your p-values. It’s hard because it requires a brute-force search through the powerset of all possible subsets of predictors. It’s hard because any optimization problem with a hard-thresholding rule like (⣿) can be re-written as an integer programming problem,
which means that it’s NP-hard (e.g., see here, here, or here). It’s hard because, in the diagram above finding the subset of variables that predicts returns is equivalent to finding the cheapest way to cover the red bar with blue bars at a cost of per blue bar, which means solving the knapsack problem.
So, the fact that Bayesian variable selection doesn’t scale is a really central problem. It means that, even if there are lots and lots of really smart traders, they may not be able to find the best subset of variables for predicting returns. You’re probably on a secure wifi network right now. This network is considered “secure” because cracking its -bit passcode would involve a brute-force search over parameter values, which would take 1 billion billion years. So, if there are over V = 313 predictors documented in top academic journals, why shouldn’t we consider the subset of variables that actually predicts returns “secure”, too? After all, finding it would involve a brute-force search over parameter values. We might be able to approximate it. But, the exact collection of variables may as well be in a vault somewhere.