Phase Change in High-Dimensional Inference

1. Introduction

In my paper Local Knowledge in Financial Markets (2014), I study a problem where assets have Q \gg 1 different attributes and traders try to identify which K \ll Q of these attributes matter via price changes:

(1)   \begin{align*} \Delta p_n &= p_n - \mathrm{E}[p_n] = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{where} \qquad K = \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}} \notag \end{align*}

with each asset’s exposure to a given attribute given by x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1) and the noise is given by \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2). In the limit as K,N,Q \to \infty, \sfrac{K}{Q} \to 0, (N - K) \cdot \beta \to \infty, and \beta = \sfrac{1}{\sqrt{K}} there exists both a signal opacity bound, N_O, as well as a signal recovery bound, N_R:

(2)   \begin{align*} N_O \sim K \cdot \log \left( \frac{Q}{N_O} \right) \qquad \text{and} \qquad N_R \sim K \cdot \log \left( \frac{Q}{K} \right) \notag \end{align*}

with N_O \leq N_R in units of transactions. I explain what I mean by “\sim” in Section 4 below. These 2 thresholds separate the regions where traders are arbitrarily bad at identifying the shocked attributes (i.e., N < N_O) from the regions where traders can almost surely identify the shocked attributes (i.e., N > N_R). i.e., if traders have seen fewer than N_O transactions, then they have no idea which shocks took place; whereas, if traders have seen more than N_R transactions, then they can pinpoint exactly which shocks took place.

In this post, I show that the signal opacity and recovery bounds become arbitrarily close in a large market. The analysis in this post primarily builds on work done in Donoho and Tanner (2009) and Wainwright (2009).

2. Motivating Example

This sort of inference problem pops up all the time in financial settings. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying a list of recent sales prices, you find yourself a bit surprised. People seem to have changed their preferences for 1 of 7 different amenities: ^{(1)}a 2 car garage, ^{(2)}a third bedroom, ^{(3)}a half-circle driveway, ^{(4)}granite countertops, ^{(5)}energy efficient appliances, ^{(6)}central A/C, or ^{(7)}a walk-in closet? The mystery amenity is raising the sale price of some houses by \beta > 0 dollars. How many sales do you need to see in order to figure out which of the 7 amenities realized the shock?

The answer is 3. How did I arrive at this number? Suppose you found one house with amenities \{1,3,5,7\}, a second house with amenities \{2, 3, 6, 7\}, and a third house with amenities \{4, 5, 6,7\}. The combination of the price changes for these 3 houses reveals exactly which amenity has been shocked. i.e., if only the first house’s price was too high, \Delta p_1 = p_1 - \mathrm{E}[p_1] = \beta, then Chicagoans must have changed their preferences for 2 car garages:

(3)   \begin{equation*}     \begin{bmatrix} \Delta p_1 \\ \Delta p_2 \\ \Delta p_3 \end{bmatrix}      =     \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix}      =      \begin{bmatrix}        1 & 0 & 1 & 0 & 1 & 0 & 1        \\        0 & 1 & 1 & 0 & 0 & 1 & 1        \\        0 & 0 & 0 & 1 & 1 & 1 & 1      \end{bmatrix}     \begin{bmatrix}        \beta \\ 0 \\ \vdots \\ 0      \end{bmatrix} \end{equation*}

By contrast, if \Delta p_1 = \Delta p_2 = \Delta p_3 = \beta, then people must value walk-in closets more than they did a year ago.

Here’s the key point. The problem changes character at N_R = 3 observations. 3 sales is just enough information to answer 7 yes or no questions and rule out the possibility of no change: 7 = 2^3 - 1. N = 4 sales simply narrows your error bars around the exact value of \beta. N = 2 sales only allows you to distinguish between subsets of amenities. e.g., seeing just the first and second houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more… not which one.

Yet, the dimensionality in this toy example can be confusing. There is obviously something different about the problem at N_R = 3 observations, but there is still some information contained in the first N = 2 observations. e.g., even though you can’t tell exactly which attribute realized a shock, you can narrow down the list of possibilities to 2 attributes out of 7. If you just flipped a coin and guessed after seeing N = 2 transactions, you would have an error rate of 50{\scriptstyle \%}. This is no longer true in higher dimensions. i.e., even in the absence of any noise, seeing any fraction (1 - \alpha) \cdot N_R of the required observations for \alpha \in (0,1) will leave you with an error rate that is within a tiny neighborhood of 100{\scriptstyle \%} as the number of attributes gets large.

3. Non-Random Analysis

I start by exploring how the required number of observations, N_R, moves around as I increase the number of attributes in the setting where there is only K = 1 shock and the data matrix is non-random. Specifically, I look at the case where K = 1 and Q = 15. My goal is to build some intuition about what I should expect in the more complicated setting where the data \mathbf{X} is a random matrix. Here, in this simple setting, the ideal data matrix would be (4 \times 15)-dimensional and look like:

(4)   \begin{equation*} \underset{4 \times 15}{\mathbf{X}} =  \left[ \begin{matrix}  1 & 0 & 1 & 0 & 1 & 0 & 1  \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\  0 & 0 & 0 & 1 & 1 & 1 & 1 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \ \ \ \begin{matrix} 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1  \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1  \end{matrix} \right] \end{equation*}

where each column of the data matrix corresponds to a number q=1,2,\ldots,15 in binary.

Let S(N) be a function that eats N observed price changes and spits out the set of possible preference changes that might explain the observed price changes. e.g., if traders only see the 1st transaction, then they can only place the shock in 1 of 2 sets containing 8 attributes each:

(5)   \begin{align*} S(1) &=  \begin{cases} \{ 1, 3, 5, 7, 9, 11, 13, 15 \}         &\text{if } \Delta p_1 = \beta \\ \{ \emptyset, 2, 4, 6, 8, 10, 12, 14 \} &\text{if } \Delta p_1 = 0 \end{cases} \end{align*}

The 2nd transaction then allows traders to split each of these 2 larger sets into 2 smaller ones and place the shock in a set of 4 possibilities:

(6)   \begin{align*} S(2) =  \begin{cases} \{ \emptyset, 4, 8, 12 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 5, 9, 13 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 \end{bmatrix}^{\top} \\ \{ 2, 6, 10, 14 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta \end{bmatrix}^{\top}  \\ \{ 3, 7, 11, 15 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta \end{bmatrix}^{\top}  \end{cases} \end{align*}

With the 3rd transaction, traders can tell that the actual shock is either of 2 possibilities:

(7)   \begin{align*} S(3) =  \begin{cases} \{ \emptyset, 8 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 9 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & 0\end{bmatrix}^{\top} \\ \{ 2, 10 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & 0 \end{bmatrix}^{\top}  \\ \{ 3, 11 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & 0 \end{bmatrix}^{\top}  \\ \{ 4, 12 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & \beta \end{bmatrix}^{\top} \\ \{ 5, 13 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & \beta \end{bmatrix}^{\top} \\ \{ 6, 14 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & \beta \end{bmatrix}^{\top}  \\ \{ 7, 15 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & \beta \end{bmatrix}^{\top}  \end{cases} \end{align*}

The N_R = 4th observation then closes the case against the offending attribute.

Here’s the key observation. Only the absolute difference between (N_R - N) matters when computing the size of the output of S(N). If traders have seen N = (N_R - 1) transaction, then they can tell which subset of 2 = 2^1 attributes has realized a shock. If traders have seen N = (N_R - 2) transactions, then they can tell which subset of 4 = 2^2 attributes has realized a shock. If traders have seen N = (N_R - 3) observations, then they can tell which subset of 8 = 2^3 attributes has realized a shock. Thus, after seeing any number of observations N \leq N_R, traders can place the shock in a set of size 2^{N_R - N}. i.e., a trader has the same amount of information about which attribute has realized a shock in (i) a situation where N_R = 100 and he’s seen N = 99 transactions as in (ii) a situation where N_R = 3 and he’s seen N = 2 transactions.

The probability that traders select the correct attribute after seeing only N \leq N_R observations is given by 2^{-(N_R - N)} assuming uniform priors. Natural numbers are hard to work with analytically, so let’s suppose that traders observe some fraction of the required number of observations N_R. i.e., for some \alpha \in (0,1) traders see N = (1 - \alpha) \cdot N_R observations. We can then perform a change of variables 2^{- (N_R - N)} = e^{- \alpha \cdot \log(2) \cdot N_R} and answer the question: “How much does getting 1 additional observation improve traders’ error rate?”

(8)   \begin{align*} \frac{1}{N_R} \cdot \frac{d}{d\alpha}\left[ \, 2^{- (N_R - N)} \, \right] = - \log(2) \cdot e^{- \alpha \cdot \log(2) \cdot N_R} \end{align*}

I plot this statistic for N_R ranging from 100 to 800 below. When N_R = 100, a trader’s predictive power doesn’t start to improve until he sees 95 transactions (i.e., 95{\scriptstyle \%} of N_R); by contrast, when N_R= 800 a trader’s predictive power doesn’t start to improve until he’s seen N = 792 transactions (i.e., 99{\scriptstyle \%} of N_R). Here’s the punchline. As I scale up the original toy example from 7 attributes to 7 million attributes, traders effectively get 0 useful information about which attributes realized a shock until they come within a hair’s breadth of the signal recovery bound N_R. The opacity and recovery bounds are right on top of one another.

plot--prediction-error

4. Introducing Randomness

Previously, the matrix of attributes was strategically chosen so that the set of N observations that traders see would be as informative as possible. Now, I want to relax this assumption and allow the data matrix to be random with elements x_{n,q} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0,1):

(9)   \begin{align*} \Delta p_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{for each }  n = 1,2,\ldots,N \end{align*}

where \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) denotes idiosyncratic shocks affecting asset n in units of dollars. For a given triplet of integers (K,N,Q) with 0 < K < N < Q, I want to know whether solving the linear program:

(10)   \begin{align*} \widehat{\boldsymbol \beta} = \min \left\Vert {\boldsymbol \beta} \right\Vert_{\ell_1} \qquad \text{subject to} \qquad \mathbf{X} {\boldsymbol \beta} = \Delta \mathbf{p} \end{align*}

recovers the true \boldsymbol \beta when it is K-sparse. i.e., when \boldsymbol \beta has only K non-zero entries K = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}}. Since N < Q the linear system is underdetermined; however, if the level of sparsity is sufficiently high (i.e., K is sufficiently small), then there will be a unique solution with high probability.

First, I study the case where there is no noise (i.e., where \sigma_\epsilon^2 \searrow 0), and I ask: “What is the minimum number of observations needed to identify the true \boldsymbol \beta with probability (1 - \eta) for \eta \in (0,1) using the linear program in Equation (10)?” I remove the noise to make the inference problem as easy as possible for traders. Thus, the proposition below which characterizes this minimum number of observations gives a lower bound. I refer to this number of observations as the signal opacity bound and write it as N_O. The proposition shows that, whenever traders have seen N < N_O observations, I can make traders’ error rate arbitrarily bad (i.e., \eta \nearrow 1) by increasing the number of attributes (i.e., Q \nearrow \infty).

Proposition (Donoho and Tanner, 2009): Suppose \sfrac{K}{N} = \rho, \sfrac{N}{Q} = \delta, and N \geq N_0 with \rho, \delta \in (0,1). The linear program in Equation (10) will recover \boldsymbol \beta a fraction (1 - \eta) if the time whenever:

(11)   \begin{align*} N > 2 \cdot K \cdot \log\left( \frac{Q}{N} \right) \cdot \left( 1 - R(\eta;N,Q) \right)^{-1} \end{align*}

where R(\eta;N,Q) = 2 \cdot \sqrt{\sfrac{1}{N} \cdot \log\left( 4 \cdot \sfrac{(Q + 2)^6}{\eta} \right)}.

Next, I turn to the case where there is noise (i.e., where \sigma_\epsilon^2 > 0), and I ask: “How many observations do traders need to see in order to identify the true \boldsymbol \beta with probability 1 in an asymptotically large market using the linear program in Equation (10)?” Define traders’ error rate after seeing N observations as:

(12)   \begin{align*}   \mathrm{Err}[N] &= \frac{1}{{Q \choose K}} \cdot \sum_{\substack{\mathcal{K} \in \mathcal{Q} \\ |\mathcal{K}| = K}} \mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) \end{align*}

\mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) denotes the probability that the linear program in Equation (10) chooses the wrong subset of attributes (i.e., makes an error) given the true support \mathcal{K} and averaging over not only the measurement noise, {\boldsymbol \epsilon}, but also the choice of the Gaussian attribute exposure matrix, \mathbf{X}. Traders’ error rate is the weighted average of these probabilities over every shock set of size K. Traders identify the true {\boldsymbol \beta} with probability 1 in an asymptotically large market if:

(13)   \begin{align*}   \lim_{\substack{K,N,Q \to \infty \\ \sfrac{K}{Q} \to 0}} \mathrm{Err}[N] &= 0 \end{align*}

Thus, the proposition below which characterizes this number of observations gives an upper bound of sorts. I refer to this number of observations as the signal recovery bound and write it as N_R. i.e., the proposition shows that, whenever traders have seen N > N_R observations, they will be able to recovery \boldsymbol \beta almost surely no matter how large I make the market.

Proposition (Wainwright, 2009): Suppose K,N,Q \to \infty, \sfrac{K}{Q} \to 0, (N - K) \cdot \beta \to \infty, and \beta = \sfrac{1}{\sqrt{K}}, then traders can identify the true {\boldsymbol \beta} with probability 1 in an asymptotically large market if for some constant a > 0:

(14)   \begin{align*} N &> a \cdot K \cdot \log (\sfrac{Q}{K}) \end{align*}

The only cognitive constraint that traders face is that their selection rule must be computationally tractable. Under minimal assumptions a convex optimization program is computationally tractable in the sense that the computational effort required to solve the problem to a given accuracy grows moderately with the dimensions of the problem. Natarajan (1995) explicitly shows that \ell_0 constrained linear programming is NP-hard. This cognitive constraint is really weak in the sense that any selection rule that you might look up in an econometrics or statistics textbook (e.g., forward stepwise regression or LASSO) is going to be computationally tractable. After all, they have to be executed on computers.

5. Discussion

plot--opacity-vs-recovery-bound-absolute-gap

What is really interesting is that the signal opacity bound, N_O, and the signal recovery bound, N_R, basically sit right on top of one another when the market gets large just as you would expect from the analysis in Section 3. The figure above plots each bound on a log-log scale for varying levels of sparsity. It’s clear from the figure that the bounds are quite close. The figure below plots the relative gap between these 2 bounds:

(15)   \begin{align*} \frac{N_R - N_O}{N_R} \end{align*}

i.e., it plots how big the gap is relative to the size of the signal recover bound N_R. For each level of sparsity, the gap is shrinking as I add more and more attributes. This is an identical result as in the figure from Section 3: as the size of the market increases, traders learn next to nothing from each successive observation until they get within an inch of the signal recovery bound. The only difference here is that now there are an arbitrary number of shocks and the data matrix is random.

plot--opacity-vs-recovery-bound-relative-gap

Intra-Industry Lead-Lag Effect

1. Introduction

Hou (2007) documents a really interesting phenomenon in asset markets. Namely, if the largest securities in an industry as measured by market capitalization perform really well in the current week, then the smallest securities in that industry tend to do well in the subsequent 2 weeks. However, the reverse relationship does not hold. i.e., if the smallest securities in an industry do well in the current week, this tells you next to nothing about how the largest securities in that industry will do in the subsequent weeks. This effect is has a characteristic time scale of 1 to 2 weeks, and varies substantially across industries.

In this post, I replicate the main finding, provide some robustness checks, and then relate the result to the analysis in my paper Local Knowledge in Financial Markets (2014).

2. Data Description

I use monthly and daily CRSP data from June 1963 to December 2001 to recreate Hou (2007), Table 1. I also replicate the same results using a different industry classification system over the period from January 2000 to December 2013. I look at securities traded on the NYSE, AMEX, and NASDAQ stock exchanges. I restrict the sample to include only securities with share codes 10 or 11. i.e., I exclude things like ADRs, closed-end funds, and REITS. I calculate weekly returns by compounding daily returns between adjacent Wednesdays:

(1)   \begin{align*} \tilde{r}_{n,t} &= (1 + \tilde{r}_{n,\text{W}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{Th}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{F}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{M}_t}) \cdot (1 + \tilde{r}_{n,\text{Tu}_t}) \end{align*}

I classify firms into industries in 2 different ways. In order to replicate Hou (2007), Table 1 I use the 12 industry classification system from Ken French’s website. This classification system is nice in the sense that it uses the SIC codes and can thus be extended back to the 1920s. However, the industry classification system that everyone in the financial industry uses is GICS codes. As a result, I also assign each firm to 1 of 24 different GICS industry subgroups.

I assign each firm to either an SIC or GICS industry based on their reported code in the monthly CRSP data as of the previous June. e.g., if I was looking at Apple, Inc in September 2005, then I would assign Apple to its industry as of June 2005; whereas, if I was looking at Apple in May 2005, then I would assign Apple to its industry as of June 2004. I use N_{i,y} to denote the number of securities in industry i in year y. In each of the figure below, I report the average number of firms in each industry on an annual basis over the sample period:

(2)   \begin{align*} \langle N_{i,y} \rangle &= \left\lfloor \frac{1}{Y} \cdot \sum_{y=1}^Y N_{i,y} \right\rfloor \end{align*}

e.g., when replicating the results in Hou (2007), Table 1 I compute the average number of firms using Y = 64 June observations.

Each June I also sort the securities in each industry i by their market cap. After sorting, I then construct an equally weighted portfolio of the largest 30{\scriptstyle \%} of stocks in each industry and the smallest 30{\scriptstyle \%} of stocks in each industry:

(3)   \begin{align*} \tilde{r}_{i,t}^B &= \frac{1}{N_{i,y_t}^{30\%}} \cdot \sum_{n=1}^{N_{i,y_t}^{30\%}} \tilde{r}_{n,t} \qquad \text{and} \qquad \tilde{r}_{i,t}^S = \frac{1}{N - N_{i,y_t}^{70\%}} \cdot \sum_{n=N_{i,y_t}^{70\%} + 1}^{N_i} \tilde{r}_{n,t} \end{align*}

In the analysis below, I look at the relationship of the weekly returns of these 2 portfolios over the subsequent year. Note that these are within industry sorts. e.g., stocks in the “big” portfolio of the consumer durables industry might be in the “small” portfolio of the telecommunications industry.

Here is the code I use to pull the data from WRDS and create the figure in Section 3 below: gist.

3. Hou (2007), Table 1

Table 1 in Hou (2007) reports the cross-autocorrelation of the big and small intra-industry portfolios defined. To estimate these statistics, I first normalize the big and small portfolio weekly returns so that they each have a mean of 0 and a standard deviation of 1:

(4)   \begin{align*} \mu_B &= \mathrm{E}[\tilde{r}_{i,t}^B] \qquad \text{and} \qquad \sigma_B = \mathrm{StDev}[\tilde{r}_{i,t}^B] \\ \mu_S &= \mathrm{E}[\tilde{r}_{i,t}^S] \qquad \text{and} \qquad \sigma_S = \mathrm{StDev}[\tilde{r}_{i,t}^S] \\ r_{i,t}^B &= \frac{\tilde{r}_{i,t}^B - \mu_B}{\sigma_B} \qquad \text{and} \qquad r_{i,t}^S = \frac{\tilde{r}_{i,t}^S - \mu_S}{\sigma_S} \end{align*}

Then, to estimate the correlation between the returns of the big portfolio in week t and the subsequent returns of the small portfolio in week (t + l) I run the regression:

(5)   \begin{align*} r_{i,t+l}^S &= \beta(l) \cdot r_{i,t}^B + \epsilon_{i,t+l} \qquad \text{for } l = 0,1,2,\ldots,6 \end{align*}

and estimate \beta(l) = \mathrm{Cor}[r_{i,t}^B,r_{i,t+l}^S]. Similarly, to estimate the correlation between the returns of the small portfolio in week t and the subsequent returns of the big portfolio in week (t + l) I run the regression:

(6)   \begin{align*} r_{i,t+l}^B &= \gamma(l) \cdot r_{i,t}^S + \epsilon_{i,t+l} \qquad \text{for } l = 0,1,2,\ldots,6 \end{align*}

to estimate \gamma(l) = \mathrm{Cor}[r_{i,t}^S,r_{i,t+l}^B]. The advantage of this approach over estimating a simple correlation matrix is that you can read off the standard errors from the regression results rather than rely on asymptotic results.

hou-2007-table-1

The figure above gives the results of these regressions using data from January 1963 to December 2001. The solid blue and red lines give the point estimates for \beta(l) and \gamma(l) respectively at lags of l=0,1,2,\ldots,6 weeks. The shaded regions around the solid lines are the 95{\scriptstyle \%} confidence intervals around these point estimates. e.g., the panel in the upper left-hand corner reports that when the largest securities in the consumer non-durables industry realize a return that is 1 standard deviation above mean in week t the smallest securities in the consumer non-durables industry realize a return that is roughly 0.30 standard deviations above their mean in week (t+1). By contrast, the smallest consumer non-durables securities have no predictive power over the future returns of their larger cousins.

4. Robustness Checks

The above results are quite interesting, but no one really uses the Ken French industry classification system when trading. The industry standard is GICS. The figure below replicates these same results over the period from January 2000 to December 2013 using the GICS codes. The results are similar, but slightly less pronounced. This replication suggests that shocks to the largest securities in an industry take roughly 2 weeks to fully propagate out to the smallest securities in the same industry.

hou-2007-table-1--gics

An obvious follow-up question is: “Is there something special about the largest firms in an industry? Or, is this cross-autocorrelation a statistical effect?” One way to shed light on this question is to look at the predictive power of the largest 10{\scriptstyle \%} of securities in each industry as opposed to the largest 30{\scriptstyle \%}:

(7)   \begin{align*} \tilde{r}_{i,t}^B &= \frac{1}{N_{i,t}^{10\%}} \cdot \sum_{n=1}^{N_{i,t}^{10\%}} \tilde{r}_{n,t} \qquad \text{and} \qquad \tilde{r}_{i,t}^S = \frac{1}{N - N_{i,t}^{70\%}} \cdot \sum_{n=N_{i,t}^{70\%} + 1}^{N_i} \tilde{r}_{n,t} \end{align*}

If there is something fundamental about size, we should expect to see an even more pronounced disparity between the predictive power of the big and small portfolios. However, the figure below shows that looking at the predictive power of the really large firms (if anything) weakens the effect. It’s definitely not more pronounced.

hou-2007-table-1--gics-10pct

5. Conclusion

What’s going on here? If size isn’t the root explanation, what is? In my paper Local Knowledge in Financial Markets propose that the true culprit is not size, but rather the number of plausible shocks that might explain a firm’s returns. e.g., Apple might have really low stock returns in the current week for all sorts of reasons: bad product release, news about factory conditions, raw materials price shock, etc… Only some of these shocks will be relevant for other firms in the industry. It takes a while to parse Apple’s bad returns and figure out how you should extrapolate to other firms. By contrast, there are many fewer ways for a small firm’s returns to go very badly in the space of a few days, and often the reason is firm-specific.

Investigation Bandwidth

1. Motivation

Time is dimensionless in modern asset pricing theory. e.g., the canonical Euler equation:

(1)   \begin{align*} P_t &= \widetilde{\mathrm{E}}_t[ \, P_{t+1} + D_{t+1} \, ] \end{align*}

says that the price of an asset at time t (i.e., P_t) is equal to the risk-adjusted expectation at time t (i.e., \widetilde{E}_t[\cdot]) of the price of the asset at time t+1 plus the risk-adjusted expectation of any dividends paid out by the asset at time t+1 (i.e., P_{t+1} + D_{t+1}). Yet, the theory never answers the question: “Plus 1 what?” Should we be thinking about seconds? Hours? Days? Years? Centuries? Millennia?

Why does this matter? An algorithmic trader adjusting his position each second worries about different risks than Warren Buffett who has a median holding period of decades. e.g., Buffett studies cash flows, dividends, and business plans. By contrast, the probability that a firm paying out a quarterly dividend happens to pay its dividend during any randomly chosen 1 second time interval is \sfrac{1}{1814400}. i.e., roughly the odds of picking a year at random since the time that the human and chimpanzee evolutionary lines diverged. Thus, if an algorithmic trader and Warren Buffett both looked at the exact same stock at the exact same time, then they would have to use different risk-adjusted expectations operators:

(2)   \begin{align*} P_t &= \begin{cases}  \widetilde{\mathrm{E}}^{\text{Alg}}_t[ \, P_{t+1{\scriptscriptstyle \mathrm{sec}}} \, ] &\text{from algorithmic trader's p.o.v.} \\ \widetilde{\mathrm{E}}^{\text{WB}}_t[ \, P_{t+1{\scriptscriptstyle \mathrm{qtr}}} + D_{t+1{\scriptscriptstyle \mathrm{qtr}}} \, ] &\text{from Warren Buffett's p.o.v.} \end{cases} \end{align*}

This note gives a simple economic model in which traders endogenously specialize in looking for information at a particular time scale and ignore predictability at vastly different time scales.

2. Simulation

I start with a simple numerical simulation that illustrates why traders at the daily horizon will ignore price patterns at vastly different frequencies. Suppose that the Cisco’s stock returns are composed of a constant growth rate \mu = \sfrac{0.04}{(480 \cdot 252)}, a daily wobble \beta \cdot \sin(2 \cdot \pi \cdot t) with \beta = \sfrac{1}{(480 \cdot 252)}, and a white noise term \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) with \sigma= \sfrac{0.12}{\sqrt{480 \cdot 252}}:

(3)   \begin{align*} R_t &= \mu + \beta \cdot \sin(2 \cdot \pi \cdot t) + \epsilon_t, \quad \text{for} \quad t = \sfrac{1}{480}, \sfrac{2}{480}, \ldots, \sfrac{10079}{480}, \sfrac{10080}{480} \end{align*}

I consider a world where the clock ticks forward in 1 minute increments so that each tick represents \sfrac{1}{480}th of a trading day. The figure below shows a single sample path of Cisco’s return process over the course of a month.

plot--daily-wobble-plus-noise

What are the properties of this return process? First, the constant growth rate, \mu = \sfrac{0.04}{(480 \cdot 252)}, implies that Cisco has a 4{\scriptstyle \%} per year return on average. Second, the volatility of the noise component, \sigma= \sfrac{0.12}{\sqrt{480 \cdot 252}}, implies that the annualized volatility of Cisco’s returns is 12{\scriptstyle \%/\sqrt{\mathrm{Yr}}}. Finally, since:

(4)   \begin{align*} \frac{1}{2 \cdot \pi} \cdot \int_0^{2 \cdot \pi} [\sin(x)]^2 \cdot dx &= 1 \end{align*}

the choice of \beta = \sfrac{1}{(480 \cdot 252)} means that (in a world with a 0{\scriptstyle \%} riskless rate) a trading strategy which is long Cisco stock in the morning and short Cisco stock in the afternoon will generate a 100{\scriptstyle \%} return over the course of 1 year. i.e., this is a big daily wobble! If you start with a \mathdollar 1 on the morning of January 1st you end up with \mathdollar 2 on the evening of December 31st on average by following this trading strategy. The figure below confirms this math by simulating 100 year long realizations of this trading strategy’s returns.

plot--cum-trading-strategy-returns

3. Trader’s Problem

Suppose you didn’t know the exact frequency of the wobble in Cisco’s returns. The wobble is equally likely to have a frequency of anywhere from \sfrac{1}{252} cycles per day to 480 cycles per day. Using the last month’s worth of data, suppose you estimated the regressions specified below:

(5)   \begin{align*} R_t &= \hat{\mu} + \hat{\beta} \cdot \sin(2 \cdot \pi \cdot f \cdot t) + \hat{\gamma} \cdot \cos(2 \cdot \pi \cdot f \cdot t) + \hat{\epsilon}_t \quad \text{for each} \quad \sfrac{1}{252} < f < 480 \end{align*}

and identified the frequency, f_{\min}, which best fit the data:

(6)   \begin{align*} f_{\min} &= \arg \min_{\sfrac{1}{252} < f < 480} \left\{ \, \hat{\sigma}(f) \, \right\} \end{align*}

The figure below shows the empirical distribution of these best in-sample fit frequencies when the true frequency is a daily wobble. The figure reads: “A month’s worth of Cisco’s minute-by-minute returns best fits a factor with a frequency of \sfrac{1}{1.01{\scriptstyle \mathrm{days}}} about 2{\scriptstyle \%} of the time when the true frequency is 1 cycle a day.”

plot--best-in-sample-fit-freq

Suppose that you notice a wobble with a frequency of \sfrac{1}{1.01{\scriptstyle \mathrm{days}}} fit Cisco’s returns over the last month really well, but you also know that this is a noisy in-sample estimate. The true wobble could have a different frequency. If you can expend some cognitive effort to investigate alternate frequencies, how wide a bandwidth of frequencies should you investigate? Here’s where things get interesting. The figure above essentially says that you should never investigate frequencies outside of f_{\min} \pm 0.5 \cdot \sfrac{1}{21}—i.e., plus or minus half the width of the bell. The probability that a pattern in returns with a frequency outside this range is actually driving the results is nil!

4. Costs and Benefits

Again, suppose you’re a trader whose noticed that there is a daily wobble in Cisco’s returns over the past month. i.e., using the past month’s data, you’ve estimated f_{\min} = \sfrac{1}{1{\scriptstyle \mathrm{day}}}. Just as before, it’s a big wobble. Implemented at the right time scale, f_\star, you know that this strategy of buying early and selling late will generate a R(f_\star) = 100{\scriptstyle \%/\mathrm{yr}} = 8.33{\scriptstyle \%/\mathrm{mon}} return. Nevertheless, you also know that f_{\min} isn’t necessarily the right frequency to invest in just because it had the lowest in-sample error over the last month. You don’t want to go to your MD and pitch a strategy only to have adjust it a month later due to poor performance. Let’s say that is costs you \kappa dollars to investigate a range of \delta frequencies. If you investigate a particular range and f_\star is there, then you will discover f_\star with probability 1.

The question is then: “Which frequency buckets should you investigate?” First, are we losing anything by only searching \delta-sized increments. Well, we can tile the entire frequency range with little tiny \delta increments as follows:

(7)   \begin{align*} 1 - \Delta(x,N) &= \sum_{n=0}^{N-1} \mathrm{Pr}\left[ \, x + n \cdot \delta \leq f_\star < x + (n + 1) \cdot \delta \, \middle| \, f_{\min} \, \right]  \end{align*}

i.e., starting at frequency x we can iteratively add N different increments of size \delta. If we start at a small enough frequency, x, and add enough increments, N, then we can tile as much of the entire domain as we like so that \Delta(x,N) is as small as we like.

Next, what are the benefits of discovering the correct time scale to invest in? If R(f_{\star}) denotes the returns to investing in a trading strategy at the correct time scale over the course of the next month, let:

(8)   \begin{align*} \mathrm{Corr}[R(f_{\star}),R(f_{\min})] &= C(f_{\star},f_{\min}) \end{align*}

denote the correlation between the returns of the strategy at the true frequency and the strategy at the best in-sample fit frequency. We know that C(f_{\star},f_{\star}) = 1 and that:

(9)   \begin{align*} \frac{dC(f_{\star},f_{\min})}{d|\log f_{\star} - \log f_{\min}|} < 0 \qquad \text{with} \qquad \lim_{|\log f_{\star} - \log f_{\min}| \to \infty} R(f_{\min}) = 0 \end{align*}

i.e., as f_{\min} gets farther and farther away from f_{\star}, your realized returns over the next month from a trading strategy implemented at horizon f_{\min} will become less and less correlated with the returns of the strategy implemented at f_{\star} and as a consequence shrink to 0. Thus, the benefit to discovering that the true frequency was not f_{\min} is given by (1 - C(f_\star,f_{\min})) \cdot R(f_{\star}).

Putting the pieces together, it’s clear that you should investigate a particular range of frequencies for a confounding explanation if the expected probability of finding f_{\star} there given the realized f_{\min} times the benefit of discovering the true f_{\star} in that range exceeds the search cost \kappa:

(10)   \begin{align*} \kappa &\leq \underbrace{\mathrm{Pr}\left[ \, x + n \cdot \delta \leq f_\star < x + (n + 1) \cdot \delta \, \middle| \, f_{\min} \, \right]}_{\substack{\text{Probability of finding $f_\star$ in a } \\ \text{particular range given observed $f_{\min}$.}}} \cdot \overbrace{(1 - C(f_\star,f_{\min})) \cdot R(f_{\star})}^{\substack{\text{Benefit of} \\ \text{discovery}}} \end{align*}

i.e., you’ll have a donut shaped search pattern around f_{\min}. You won’t investigate frequencies that are really different from f_{\min} since the probability of finding f_{\star} there will be too low to justify the search costs. By contrast you won’t investigate frequencies that are too similar to f_{\min} since the benefits to discovering this minuscule error don’t justify the costs even though such tiny errors may be quite likely.

5. Wrapping Up

I started with the question: “How can it be that an algorithmic trader and Warren Buffett worry about different patterns in the same price path?” In the analysis above I give one possible answer. If you see a tradable anomaly at a particular time scale (e.g., 1 wobble per day) over the past month, then the probability that this anomaly was caused by a data generating process with a much shorter or much longer frequency is essentially 0. I used only sine wave plus noise processes above, but it seems like this assumption can be easily relaxed via results from, say, Friedlin and Wentzell.

The Secrets N Prices Keep

1. Introduction

Prices are signals about shocks to fundamentals. In a world where there are many stocks and lots of different kinds of shocks to fundamentals, traders are often more concerned with identifying exactly which shocks took place than the value of any particular asset. e.g., imagine you are a day trader. While you certainly care about changes in the fundamental value of Apple stock, you care much more about the size and location of the underlying shocks since you can profit from this information elsewhere. On one hand, if all firms based in California were hit with a positive shock, you might want to buy shares of Apple, Banana Republic, Costco, …, and Zero Skateboards stock. On the other hand, if all electronic equipment companies were hit with a positive shock, you might want to buy up Apple, Bose, Cisco Systems, …, and Zenith shares instead.

It turns out that there is a sharp phase change in traders’ ability to draw inferences about attribute-specific shocks from prices. i.e., when there have been fewer than N^\star transactions, you can’t tell exactly which shocks affected Apple’s fundamental value. Even if you knew that Apple had been hit by some shock, with fewer than N^\star observations you couldn’t tell whether it was a California-specific event or an electronic equipment-specific event. By contrast, when there have been more than N^\star transactions, you can figure out exactly which shocks have occurred. The additional (N - N^\star) transactions simply allow you to fine tune your beliefs about exactly how large the shocks were. The surprising result is that N^\star is a) independent of traders’ cognitive abilities and b) easily calculable via tools from the compressed sensing literature. See my earlier post for details.

This signal recovery bound is thus a novel new constraint on the amount of information that real world traders can extract from prices. Moreover, the bound gives a concrete meaning to the term “local knowledge”. e.g., shocks that haven’t yet manifested themselves in N^\star transactions are local in the sense that no one can spot them through prices. Anyone who knows of their existence must have found out via some other channel. To build intuition, this post gives 3 examples of this constraint in action.

2. Out-of-Town House Buyer

First I show where this signal recovery bound comes from. People spend lots of time looking for houses in different cites. e.g., see Trulia or my paper. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying at a list of recent sales prices, you find yourself a bit surprised. People must have changed their preferences for 1 of 7 different amenities: ^{(1)}a 2 car garage, ^{(2)}a 3rd bedroom, ^{(3)}a half-circle driveway, ^{(4)}granite countertops, ^{(5)}energy efficient appliances, ^{(6)}central A/C, or ^{(7)}a walk-in closet. Having the mystery amenity raises the sale price by \beta > 0 dollars. You would know how preferences had evolved if you had lived in Chicago the whole time; however, in the absence of this local knowledge, how many sales would you need to see in order to figure out which of the 7 amenities mattered?

The answer is 3. How did I come up with this number? For ease of explanation, let’s normalize expected house prices to \mathrm{E}_{t-1}[p_{n,t}] = 0. Suppose you found one house with amenities \{1,3,5,7\}, a second house with amenities \{2, 3, 6, 7\}, and a third house with amenities \{4, 5, 6,7\}. The combination of prices for these 3 houses would reveal exactly which amenity had been shocked. i.e., if only the first house’s price was higher than expected, p_{1,t} \approx \beta, then Chicagoans must have changed their preferences for having a 2 car garage:

(1)   \begin{equation*} {\small  \begin{bmatrix} p_{1,t} \\ p_{2,t} \\ p_{3,t} \end{bmatrix}  = \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix}  =  \begin{bmatrix}  1 & 0 & 1 & 0 & 1 & 0 & 1  \\  0 & 1 & 1 & 0 & 0 & 1 & 1  \\  0 & 0 & 0 & 1 & 1 & 1 & 1  \end{bmatrix} \begin{bmatrix}  \beta \\ 0 \\ \vdots \\ 0  \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix}  } \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2), \, \beta \gg \sigma \end{equation*}

By contrast, if it was the case that p_{1,t} \approx \beta, p_{2,t} \approx \beta, and p_{3,t} \approx \beta, then you would know that people now value walk-in closets much more than they did a year ago.

Here is the key point. 3 sales is just enough information to answer 7 yes or no questions and rule out the possibility of no change:

(2)   \begin{align*}   7 = 2^3 - 1 \end{align*}

N = 4 sales simply narrows your error bars around the exact value of \beta. N = 2 sales only allows you to distinguish between subsets of amenities. e.g., seeing just the 1st and 2nd houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more. It doesn’t tell you which one. The problem changes character at N = N^\star(7,1) = 3. When you have seen fewer than N^\star = 3 sales, information about how preferences have changed is purely local knowledge. Prices can’t publicize this information. You must live and work in Chicago to learn it.

3. Industry Analyst’s Advantage

Next, I illustrate how this signal recovery bound acts like a cognitive constraint for would be arbitrageurs. Suppose you’re a petroleum industry analyst. Through long, hard, caffeine-fueled nights of research you’ve discovered that oil companies such as Schlumberger, Halliburton, and Baker Hughes who’ve invested in hydraulic fracturing (a.k.a., “fracking”) are due for a big unexpected payout. This is really valuable information affecting only a few of the major oil companies. Many companies haven’t really invested in this technology, and they won’t be affected by the shock. How aggressively should you trade Schlumberger, Halliburton, and Baker Hughes? On one hand, you want to build up a large position in these stocks to take advantage of the future price increases that you know are going to happen. On the other hand, you don’t want to allow news of this shock to spill out to the rest of the market.

In the canonical Grossman and Stiglitz (1980)-type setup, the reason that would be arbitrageurs can’t immediately infer your hard-earned information from prices is the existence of noise traders. They can’t be completely sure whether a sudden price movement is due to a) your informed trading or b) random noise trader demand. Here, I propose a new confound: the existence of many plausible shocks. e.g., suppose you start aggressively buying up shares of Schlumberger, Halliburton, and Baker Hughes stock. As an arbitrageur I see the resulting gradual price increases in these 3 stocks, and ask: “What should my next trade be?” Here’s where things get interesting. When there have been fewer than N^\star transactions in the petroleum industry, I can’t tell whether you are trading on a Houston, TX-specific shock or a fracking-specific shock since all 3 of these companies share both these attributes. I need to see at least N^\star observations in order to recognize the pattern you’re trading on.

petroleumindustry-search-subjects

The figure above gives a sense of the number of different kinds of shocks that affect the petroleum industry. It reads: “If you select a Wall Street Journal article on the petroleum industry over the period from 2011 to 2013 there is a 19{\scriptstyle \%} chance that ‘Oil sands’ is a listed descriptor and a 7{\scriptstyle \%} chance that ‘LNG’ (i.e., liquid natural gas) is a listed descriptor.” Thus, oil stock price changes might be due to Q \gg 1 different shocks:

(3)   \begin{align*} \hat{p}_{n,t} &= p_{n,t} - \mathrm{E}_{t-1}[p_{n,t}] = \sum_{q=1}^Q \beta_{q,t} \cdot x_{n,q} + \epsilon_{n,t} \qquad \text{with} \qquad \epsilon_{n,t} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) \end{align*}

where x_{n,q} denotes stock n‘s exposure to the qth attribute. e.g., in this example x_{n,q} = 1 if the company invested in fracking (i.e., like Schlumberger, Halliburton, and Baker Hughes) and x_{n,q}=0 if the company didn’t. What’s more, very few of the Q possible attributes matter each month. e.g., the plot below reads: “Only around 10{\scriptstyle \%} of all the descriptors in the Wall Street Journal articles about the petroleum industry over the period from January 2011 to December 2013 are used each month.” Thus, only K of the possible Q attributes appear to realize shocks each period:

(4)   \begin{align*} K &= \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{\beta_q \neq 0\}} \qquad \text{with} \qquad K \ll Q \end{align*}

Note that this calculation includes terms like ‘Crude oil prices’ which occur in roughly half the articles, so the actual rate is likely much lower. Crude oil prices is just a synonym for the industry.

petroleumindustry--fraction-of-search-subjects-mentioned-each-month

For simplicity, suppose that 10 attributes out of a possible 100 realized a shock in the previous period, and you discovered 1 of them. How long does your informational monopoly last? Using tools from Wainwright (2009) it’s easy to show that uninformed traders need at least:

(5)   \begin{align*} N^\star(100,10) \approx 10 \cdot \log(100 - 10) = 45  \end{align*}

observations to identify which 10 of the 100 possible payout-relevant attributes in the petroleum industry has realized a shock. If it takes you (…and other industry specialists like you) around 1 hour to materially increase your position, then you have roughly 5.6 = \sfrac{45}{8} days (i.e., around 1 trading week) to build up a position before the rest of the market catches on assuming an 8 hour trading day.

4. Asset Management Expertise

Finally, I show how there can be situations where you might not bother trying to learn from prices because there are too many plausible explanations to check out. In this world everyone specializes in acquiring local knowledge. Suppose you’re a wealthy investor, and I’m a broke asset manager with a trading strategy. I walk into your office, and I try to convince you to finance my strategy that has abnormal returns of r_t per month:

(6)   \begin{align*}   r_t &= \mu + \epsilon_t   \qquad \text{with} \qquad    \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}

where \sigma_{\epsilon}^2 = 1{\scriptstyle \%} per month to make the algebra neat. For simplicity, suppose that there is no debate \mu > 0. In return for running the trading strategy, I ask for fees amounting to a fraction f of the gross returns. Of course, I have to tell you a little bit about how the trading strategy works, so you can deduce that I’m taking on a position that is to some extent a currency carry trade and to some extent a short-volatility strategy. This narrows down the list a bit, but it still leaves a lot of possibilities. In the end, you know that I am using some combination of K = 2 out of Q = 100 possible strategies.

You have 2 options. On one hand, if you accept the terms of this offer and finance my strategy, you realize returns net of fees equal to:

(7)   \begin{align*}   (1 - f) \cdot \mu \cdot T + \sum_{t=1}^T \epsilon_t \end{align*}

This approach would net you an annualized Sharpe ratio of \text{SR}_{\text{mgr}} = \sqrt{12} \cdot (1 - f) \cdot \sfrac{\mu}{\sigma} e.g., if I asked for a fee of f = 20{\scriptstyle \%}, and my strategy yielded a return of 2{\scriptstyle \%} per month, then your annualized Sharpe ratio net of my fees would be \text{SR}_{\text{mgr}} = 0.55.

On the other hand, you could always refuse my offer and try to back out which strategies I was following using the information you gained from our meeting. i.e., you know that my strategy involves using some combination of K=2 factors out of a universe of Q = 100 possibilities:

(8)   \begin{align*}   \mu &= \sum_{q=1}^{100} \beta_q \cdot x_{q,t}   \qquad \text{with} \qquad    \Vert {\boldsymbol \beta} \Vert_{\ell_0} = 3 \end{align*}

In order to deduce which strategies I was using as quickly as possible, you’d have to trade random portfolio combinations of these 100 different factors for:

(9)   \begin{align*}   T^\star(100,2) \approx 2 \cdot \log(100 - 2) = 9.17 \, {\scriptstyle \mathrm{months}} \end{align*}

Your Sharpe ratio during this period would be \text{SR}_{\text{w/o mgr}|\text{pre}} = 0, and afterwards you would earn the same Sharpe ratio as before without having to pay any fees to me:

(10)   \begin{align*}   \text{SR}_{\text{w/o mgr}|\text{post}} &= \sqrt{12} \cdot \left( \frac{0.02}{0.10} \right) = 0.69 \end{align*}

However, if you have to show your investors reports every year, it may not be worth it for you to reverse engineer my trading strategy. Your average Sharpe ratio during this period would be:

(11)   \begin{align*}   \text{SR}_{\text{w/o mgr}} &= \sfrac{9.17}{12} \cdot 0 + \sfrac{(12 - 9.17)}{12} \cdot 0.69 = 0.16 \end{align*}

which is well below the Sharpe ratio on the market portfolio. Thus, you may just want to pay my fees. Even though you could in principle back out which strategies I was using, it would take too long. You’re investors would withdraw due to poor performance before you could capitalize on your newfound knowledge.

5. Discussion

To cement ideas, let’s think about what this result implies for a financial econometrician. We’ve known since the 1970s that there is a strong relationship between oil shocks and the rest of the economy. e.g., see Hamilton (1983), Lamont (1997), and Hamilton (2003). Imagine you’re now an econometrician, and you go back and pinpoint the exact house when each fracking news shock occurred over the last 40 years. Using this information, you then run an event study which finds that petroleum stocks affected by each news shock display a positive cumulative abnormal return over the course of the following week. Would this be evidence of a market inefficiency? Are traders still under-reacting to oil shocks? No. Ex post event studies assume that traders know exactly what is and what isn’t important in real time. Non-petroleum industry specialists who didn’t lose sleep researching hydraulic fracturing have to parse out which shocks are relevant only from prices. This takes time. In the interim, this knowledge is local.

How Quickly Can We Decipher Price Signals?

1. Introduction

There are many different attribute-specific shocks that might affect an asset’s fundamental value in any given period. e.g., the prices of all stocks held in model-driven long/short equity funds might suddenly plummet as happened in the Quant Meltdown of August 2007. Alternatively, new city parking regulations might raise the value of homes with a half circle driveway. Innovations in asset prices are signals containing 2 different kinds of information: a) which of these Q different shocks has taken place and b) how big each of them was.

It’s often a challenge for traders to answer question (a) in real time. e.g., Daniel (2009) notes that during the Quant Meltdown “markets appeared calm to non-quantitative investors… you could not tell that anything was happening without quant goggles.” This post asks the question: How many transactions do traders need to see in order to identify shocked attributes? The surprising result is that there is a well-defined and calculable answer to this question that is independent of traders’ cognitive abilities. Local knowledge is an unavoidable consequence of this location recovery bound.

2. Motivating Example

It’s easiest to see where this location recovery bound comes from via a short example. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When looking at a list of recent sales prices, you find yourself surprised. People must have changed their preferences for 1 of 7 different amenities: ^{(1)}a 2 car garage, ^{(2)}a 3rd bedroom, ^{(3)}a half-circle driveway, ^{(4)}granite countertops, ^{(5)}energy efficient appliances, ^{(6)}central A/C, or ^{(7)}a walk-in closet. Having the mystery amenity raises the sale price by \beta > 0 dollars. To be sure, you would know how preferences had evolved if you had lived in Chicago the whole time; however, in the absence of this local knowledge, how many sales would you need to see in order to figure out which of the 7 amenities mattered?

The answer is 3. Where does this number come from? For ease of explanation, let’s normalize the expected house prices to \mathrm{E}_{t-1}[p_{1,t}] = 0. Suppose you found one house with amenities \{1,3,5,7\}, a second house with amenities \{2, 3, 6, 7\}, and a third house with amenities \{4, 5, 6,7\}. The combination of prices for these 3 houses would reveal exactly which amenity had been shocked. i.e., if only the first house’s price was higher than expected, p_{1,t} \approx \beta, then Chicagoans must have changed their preferences for having a 2 car garage:

(1)   \begin{equation*} {\small  \begin{bmatrix} p_{1,t} \\ p_{2,t} \\ p_{3,t} \end{bmatrix}  = \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix}  =  \begin{bmatrix}  1 & 0 & 1 & 0 & 1 & 0 & 1  \\  0 & 1 & 1 & 0 & 0 & 1 & 1  \\  0 & 0 & 0 & 1 & 1 & 1 & 1  \end{bmatrix} \begin{bmatrix}  \beta \\ 0 \\ \vdots \\ 0  \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix}  } \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2), \, \beta \gg \sigma \end{equation*}

By contrast, if it was the case that p_{1,t} \approx \beta, p_{2,t} \approx \beta, and p_{3,t} \approx \beta, then you would know that people now value walk-in closets much more than they did a year ago.

Here is the key point. 3 sales is just enough information to answer 7 yes or no questions and rule out the possibility of no change:

(2)   \begin{align*}   7 = 2^3 - 1 \end{align*}

N = 4 sales simply narrows your error bars around the exact value of \beta. N = 2 sales only allows you to distinguish between subsets of amenities. e.g., seeing just the 1st and 2nd houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more. It doesn’t tell you which one. The problem changes character at N = N^\star(7,1) = 3… i.e., the location recovery bound.

3. Main Results

This section formalizes the intuition from the example above. Think about innovations in the price of asset n as the sum of a meaningful signal, f_n, and some noise, \epsilon_n:

(3)   \begin{align*} p_{n,t} - \mathrm{E}_{t-1}[p_{n,t}] &= f_n + \epsilon_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) \end{align*}

where the signal can be decomposed into Q different attribute-specific shocks. In Equation (3) above, \beta_q \neq 0 denotes a shock of size |\beta_q| to the qth attribute and x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N}) denotes the extent to which asset n displays the qth attribute. Each of the data columns is normalized so that \mathrm{E} \, \sum_{n=1}^N \mathrm{Var}[x_{n,q}] = 1.

In general, when there are more attributes than shocks, K < Q, picking out exactly which K attributes have realized a shock is a combinatorially hard problem as discussed in Natarajan (1995). However, suppose you had an oracle which could bypass this hurdle and tell you exactly which attributes had realized a shock:

(4)   \begin{align*} \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2} &= \inf_{\{\hat{\boldsymbol \beta} : \#[\beta_q \neq 0] \leq K\}} \, \Vert \mathbf{f} - \mathbf{X}\hat{\boldsymbol \beta} \Vert_{\ell_2} \end{align*}

In this world, your mean squared prediction error, \mathrm{MSE} = \frac{1}{N} \cdot \Vert \mathbf{p} - \hat{\mathbf{p}} \Vert_{\ell_2}^2, is given by:

(5)   \begin{align*} \mathrm{MSE}^{\text{Oracle}} & = \min_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{N^{\text{Oracle}}} \cdot \Vert \mathbf{p} - \hat{\mathbf{p}}^{\text{Oracle}} \Vert_{\ell_2}^2 \, \right\} = \min_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{N^{\text{Oracle}}} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \sigma^2 \, \right\} \end{align*}

where N^{\text{Oracle}} = N^{\text{Oracle}}(Q,K) = K denotes the number of observations necessary for your oracle. e.g., if each \beta_q \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Bin}(\kappa), then \mathrm{MSE}^{\text{Oracle}} = \sigma^2 since there is only variation in the location of the shocks and not the size of the shocks.

It turns out that if each asset isn’t too redundant relative to the number of shocked attributes, then you can achieve a mean squared error that is within a log factor of the oracle’s mean squared error using many fewer observations than there are attributes, N \ll Q. e.g., suppose that you used a lasso estimator:

(6)   \begin{align*} \hat{\boldsymbol \beta}^{\text{Lasso}} &= \arg\min_{\hat{\boldsymbol \beta}} \, \left\{ \, \frac{1}{2} \cdot \Vert \mathbf{p} - \mathbf{X} \hat{\boldsymbol \beta} \Vert_{\ell_2}^2 + \lambda_{\ell_1} \cdot \sigma \cdot \Vert \hat{\boldsymbol \beta} \Vert_{\ell_1} \, \right\} \end{align*}

with \lambda_{\ell_1} = 2 \cdot \sqrt{2 \cdot \log Q}. Then, Candes and Davenport (2011) show that:

(7)   \begin{align*} \mathrm{MSE}^{\text{Lasso}} &\leq \gamma \cdot \inf_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{K} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \log Q \cdot \sigma^2 \, \right\} \end{align*}

with probability 1 - 6 \cdot Q^{-2 \cdot \log 2} - Q^{-1} \cdot (2 \cdot \pi \cdot \log Q)^{-\sfrac{1}{2}} where \gamma > 0 is a small numerical constant. However, this paragraph is quite loose. i.e., what exactly does the condition that “each asset isn’t too redundant relative to the number of shocked attributes” mean? Exactly how many observations would you need to see if each asset’s attribute exposure is drawn as x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N})?

Here’s where things get really interesting. Wainwright (2009) shows that there is a sharp bound on the number of observations, N^\star = N^\star(Q,K), that you need to observe in order for \ell_1-type estimators like Lasso to succeed when attribute exposure is drawn iid Gaussian:

(8)   \begin{align*} N^\star(Q,K) &= \mathcal{O}\left[K \cdot \log(Q - K)\right] \end{align*}

with Q \to \infty, K \to \infty, and \sfrac{K}{Q} \to \kappa for some \kappa > 0. When traders observe N < N^\star(Q,K) observations picking out which attributes have realized a shock is an NP-hard problem; whereas, when they observe N \geq N^\star(Q,K) there exist efficient convex optimization algorithms that solve this problem. This result says how the N^\star = 3 location recovery bound from the motivating example generalizes to arbitrary numbers of attributes, Q, and shocks, K.

4. Just Identified

I conclude this post by discussing the non-sparse case. i.e., \boldsymbol \beta usually isn’t sparse in econometric textbooks á la Hayashi, Wooldridge, or Angrist and Pischke. When every one of the Q attributes matters, it’s easy to decide which attributes to pay attention to—i.e., all of them. In this situation the mean squared error for an oracle is the same as the mean squared error for mere mortals:

(9)   \begin{align*} \mathrm{MSE}^{\text{Oracle}} & = \frac{1}{Q} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \sigma^2 = \frac{1}{Q} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Mortal}} \Vert_{\ell_2}^2 + \sigma^2 = \mathrm{MSE}^{\text{Mortal}} \end{align*}

Does the location recovery bound disappears in this setting?

No. This is not the case. Indeed, the attribute selection bound corresponds to the usual N \geq Q requirement for identification. To see why, let’s return to the motivating example in Section 2, and consider the case where any of the 7 attributes could have realized a shock. This leaves us with 128 different shock combinations:

(10)   \begin{align*} 128 &= {7 \choose 0} + {7 \choose 1} + {7 \choose 2} + {7 \choose 3} + {7 \choose 4} + {7 \choose 5} + {7 \choose 6} + {7 \choose 7} \\ &= 1 + 7 + 21 + 35 + 35 + 21 + 7 + 1 \\ &= 2^7 \end{align*}

so that N^\star = 7 gives just enough differences to identify which combination of shocks was realized. More generally, we have that for any number of attributes, Q:

(11)   \begin{align*} 2^Q &= \sum_{k=0}^Q {Q \choose k} \end{align*}

This gives an interesting information theoretic interpretation to the meaning of “just identified” that has nothing to do with linear algebra or the invertibility of a matrix.