Two Period Kyle (1985) Model

1. Motivation

This post shows how to solve for the equilibrium price impact and demand coefficients in a 2 period Kyle (1985)-type model where informed traders see a noisy signal about the fundamental value of a single asset. There are various other places where you can see how to solve this sort of model. e.g., take a look at Markus Brunnermeier’s class notes or Laura Veldkamp’s excellent textbook. Both these sources solve the static 1 period model in closed form, and then give the general T \geq 1 period form of the dynamic multi-period model. Any intuition that I can get with a dynamic model usually comes in the first 2 periods, so I find myself frequently working out the 2 period model explicitly. Here is that model.

2. Market description

I begin by outlining the market setting. Consider a world with 2 trading periods t = 1, 2 and a single asset whose fundamental value is given by:

(1)   \begin{align*} v \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \sigma_{v}^2) \end{align*}

in units of dollars per share. There are 2 kinds of agents: informed traders and noise traders. Both kinds of traders submit market orders to a group of market makers who see only the aggregate order flow, \Delta x_t, each period:

(2)   \begin{align*} \Delta x_t &= \Delta y_t + \Delta z_t \end{align*}

where \Delta y_t denotes the order flow from the informed traders and \Delta z_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \sigma_{\Delta z}^2) denotes the order flow from the noise traders. The market makers face perfect competition, so they have to set the price each period equal to their expectation of the fundamental value of the asset given aggregate demand:

(3)   \begin{align*} p_1 &= \mathrm{E}[v|\Delta x_1] \qquad \text{and} \qquad p_2 = \mathrm{E}[v|\Delta x_1, \Delta x_2] \end{align*}

Prior to the start of the first trading period, informed traders see an unbiased signal s about the asset’s fundamental value:

(4)   \begin{align*} s = v + \epsilon \qquad \text{where} \qquad \epsilon \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}

so that s \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(v,\sigma_{\epsilon}^2). In period 1, these traders choose the number of shares to demand from the market maker, \Delta y_1, to solve:

(5)   \begin{align*} \mathrm{H}_0 = \max_{\Delta y_1} \, \mathrm{E}\left[ \, (v - p_1) \cdot \Delta y_1 + \mathrm{H}_1 \, \middle| \, s \, \right] \end{align*}

where \mathrm{H}_{t-1} denotes their value function entering period t. Similarly, in period 2 these traders optimize:

(6)   \begin{align*} \mathrm{H}_1 = \max_{\Delta y_2} \, \mathrm{E} \left[ \, (v - p_2) \cdot \Delta y_2 \, \middle| \, s, \, p_1  \ \right] \end{align*}

The extra H_1 term shows up in informed traders’ time t=1 optimization problem but not their time t=2 optimization problem because the model ends after the second trading period.

An equilibrium is a linear demand rule for the informed traders in each period:

(7)   \begin{align*}  \Delta y_t = \alpha_{t-1} + \beta_{t-1} \cdot s \end{align*}

and a linear market maker pricing rule in each period:

(8)   \begin{align*}  p_t = \kappa_{t-1} + \lambda_{t-1} \cdot \Delta x_t \end{align*}

such that given the demand rule in each period the pricing rule solves the market maker’s problem, and given the market maker pricing rule in each period the demand rule solves the trader’s problem.

3. Information and Updating

The informed traders need to update their beliefs about the fundamental value of the asset after observing their signal s. Using DeGroot (1969)-style updating, it’s possible to compute their posterior beliefs:

(9)   \begin{align*} \sigma_{v|s}^2 &= \left( \frac{\sigma_{\epsilon}^2}{\sigma_v^2 + \sigma_{\epsilon}^2} \right) \times \sigma_v^2 \qquad \text{and} \qquad \mu_{v|s} = \underbrace{\left( \frac{\sigma_v^2}{\sigma_v^2 + \sigma_{\epsilon}^2} \right)}_{\theta} \times s \end{align*}

After observing aggregate order flow in period t=1, market makers need to update their beliefs about the true value of the asset. Using the linearity of informed traders’ demand rule, we can rewrite the aggregate demand as a signal about the fundamental value as follows:

(10)   \begin{align*} \frac{\Delta x_1}{\beta_0} &= v + \left( \epsilon + \frac{\Delta z_1}{\beta_0} \right) \end{align*}

Note that both the signal error and noise trader demand cloud the market makers’ inference. Using the same DeGroot updating strategy, it’s possible to compute the market makers’ posterior beliefs about v as follows:

(11)   \begin{align*} \sigma_{v|\Delta x_1}^2 = \left( \frac{\beta_0^2 \cdot \sigma_{\epsilon}^2 + \sigma_{\Delta z}^2}{\beta_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \times \sigma_v^2 \quad \text{and} \quad \mu_{v|\Delta x_1} = \left( \frac{\beta_0^2 \cdot \sigma_v^2}{\beta_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \times \Delta x_1 \end{align*}

It’s also possible to view the aggregate order flow in time t=1 as a signal about the informed traders’ signal rather than the fundamental value of the asset:

(12)   \begin{align*} \frac{\Delta x_1}{\beta_0} &= s + \frac{\Delta z_1}{\beta_0} \end{align*}

yielding posterior beliefs:

(13)   \begin{align*} \sigma_{s|\Delta x_1}^2 = \left( \frac{\sigma_{\Delta z}^2}{\sigma_{\Delta z}^2 + \beta_0^2 \cdot \sigma_s^2} \right) \times \sigma_s^2 \quad \text{and} \quad \mu_{s|\Delta x_1} = \left( \frac{\beta_0^2 \cdot \sigma_s^2}{\sigma_{\Delta z}^2 + \beta_0^2 \cdot \sigma_s^2} \right) \times \Delta x_1 \end{align*}

4. Second Period Solution

With the market description and information sets in place, I can now solve the model by working backwards. Let’s start with the market makers’ time t=2 problem. Since the market maker faces perfect competition, the time t=1 price has to satisfy the condition:

(14)   \begin{align*} \mathrm{E}[v|\Delta x_1] &= p_1 \end{align*}

As a result, \kappa_0 = 0 and

(15)   \begin{align*} \kappa_1  &= \mathrm{E}[v|\Delta x_1] - \lambda_1 \cdot \mathrm{E}[\Delta x_2|\Delta x_1] = p_1 - \underbrace{(\theta \cdot \mu_{s | \Delta x_1} - p_1)}_{=0} = p_1 \end{align*}

However, this is about all we can say without knowing more about how the informed traders behave.

Moving to the informed traders’ time t=2 problem, we see that they optimize over the size of their time t=2 market order with knowledge of their private signal, s, and the time t=1 price, p_1, as follows:

(16)   \begin{align*} \mathrm{H}_1 &= \max_{\Delta y_2} \ \mathrm{E} \left[ \, \left(v - \kappa_1 - \lambda_1 \cdot \Delta x_2 \right) \cdot \Delta y_2 \, \middle| \, s, p_1  \, \right] \end{align*}

Taking the first order condition yields an expression for their optimal time t=2 demand:

(17)   \begin{align*} \Delta y_2 &= \underbrace{- \, \frac{p_1}{2 \cdot \lambda_1}}_{\alpha_1} + \underbrace{\frac{\theta}{2 \cdot \lambda_1}}_{\beta_1} \cdot s \end{align*}

Informed traders place market orders in period t=2 that are linearly increasing in the size of their private signal; what’s more, if we hold the equilibrium value of \lambda_1 constant, they will trade more aggressively when they have a more accurate private signal (i.e., \sigma_{\epsilon}^2 \searrow 0).

If we now return to the market makers’ problem, we can partially solve for the price impact coefficient in period t=2:

(18)   \begin{align*} \lambda_1  &= \frac{\mathrm{Cov}[ \Delta x_2, v | \Delta x_1]}{\mathrm{Var}[ \Delta x_2| \Delta x_1]} = \frac{\mathrm{Cov}\left[ \, \alpha_1 + \beta_1 \cdot s + \Delta z_2, v \, \middle| \, \Delta x_1 \, \right]}{\mathrm{Var}\left[ \, \alpha_1 + \beta_1 \cdot s + \Delta z_2 \, \middle| \, \Delta x_1 \, \right]} = \frac{\beta_1 \cdot \sigma_{v|\Delta x_1}^2}{\beta_1^2 \cdot \sigma_{s|\Delta x_1}^2 + \sigma_{\Delta z}^2} \end{align*}

However, to go any further and solve for \sigma_{v|\Delta x_1}^2 or \sigma_{s|\Delta x_1}^2, we need to know how aggressively traders will act on their private information in period t=1… we need to know \beta_0.

5. First Period Solution

To solve the informed traders’ time t=1 problem, I first make an educated guess about the functional form of their value function:

(19)   \begin{align*} \mathrm{E}[\mathrm{H}_1|s] &= \psi_1 + \omega_1 \cdot \left( \mu_{v|s} - p_1 \right)^2 \end{align*}

We can now solve for the time t=1 equilibrium parameter values by plugging in the linear price impact and demand coefficients to the informed traders’ optimization problem:

(20)   \begin{align*} \mathrm{H}_0 &= \max_{\Delta y_1} \, \mathrm{E}\left[ \, (v - p_1) \cdot \Delta y_1 + \psi_1 + \omega_1 \cdot \left( \theta \cdot s - p_1 \right)^2 \, \middle| \, s \, \right] \end{align*}

Taking the first order condition with respect to the informed traders’ time t=1 demand gives:

(21)   \begin{align*} 0 &= \mathrm{E}\left[ \, \left(v - 2 \cdot \lambda_0 \cdot \Delta y_1 - \lambda_0 \cdot \Delta z_1 \right)   - 2 \cdot \omega_1 \cdot \lambda_0 \cdot \left( \theta \cdot s - \lambda_0 \cdot \{ \Delta y_1 + \Delta z_1  \} \right) \, \middle| \, s \, \right] \end{align*}

Evaluating their expectation operator yields:

(22)   \begin{align*} 0 &= \theta \cdot s - 2 \cdot \lambda_0 \cdot \Delta y_1 - 2 \cdot \omega_1 \cdot \lambda_0 \cdot \left\{   \theta \cdot s - \lambda_0 \cdot \Delta y_1 \right\}  \end{align*}

Rearranging terms then gives the informed traders’ demand rule which is a linear function of the signal they got about the asset’s fundamental value:

(23)   \begin{align*} \Delta y_1 &= \frac{\theta}{2 \cdot \lambda_0} \cdot \left( \frac{1 - 2 \cdot \omega_1 \cdot \lambda_0}{1 - \omega_1 \cdot \lambda_0} \right) \cdot s \end{align*}

Finally, using the same projection formula as above, we can solve for the market makers’ price impact rule:

(24)   \begin{align*} \lambda_0 &= \frac{\mathrm{Cov}[ \Delta x_1, v]}{\mathrm{Var}[ \Delta x_1]} = \frac{\mathrm{Cov}[\alpha_0 + \beta_0 \cdot (v + \epsilon) + \Delta z_1, v]}{\mathrm{Var}[ \alpha_0 + \beta_0 \cdot s + \Delta z_1]} = \frac{\beta_0 \cdot \sigma_v^2}{\beta_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \end{align*}

6. Guess Verification

To wrap things up, let’s now check that my guess about the value function is consistent. Looking at the informed traders’ time t=2 problem, and substituting in the equilibrium coefficients we get:

(25)   \begin{align*} \mathrm{H}_1 &= \mathrm{E} \left[ \, \left(v - p_2 \right) \cdot \Delta y_2 \, \middle| \, s  \, \right] \\ &= \mathrm{E} \left[ \, \left(v - \left\{p_1 + \lambda_1 \cdot \left( \alpha_1 + \beta_1 \cdot s + \Delta z_2 \right) \right\}  \right) \times \left( \alpha_1 + \beta_1 \cdot s \right) \, \middle| \, s  \, \right] \end{align*}

Using the fact that \alpha_1 = -\sfrac{p_1}{(2 \cdot \lambda_1)} and \beta_1 = \sfrac{\theta}{(2 \cdot \lambda_1)} then leads to:

(26)   \begin{align*} \mathrm{H}_1 &= \mathrm{E} \left[ \, \frac{1}{2 \cdot \lambda_1} \times \left( \left\{ v - p_1 \right\} - \frac{1}{2} \cdot \left\{ \theta \cdot s - p_1 \right\} - \lambda_1 \cdot \Delta z_2 \right) \times \left( \theta \cdot s - p_1 \right) \, \middle| \, s  \, \right] \end{align*}

Adding and subtracting \mu_{s | \Delta x_1} = \theta \cdot s in the first term simplifies things even further:

(27)   \begin{align*} \mathrm{H}_1 &= \mathrm{E} \left[ \, \frac{1}{2 \cdot \lambda_1} \times \left( \left\{ v - \theta \cdot s \right\} + \frac{1}{2} \cdot \left\{ \theta \cdot s - p_1 \right\} - \lambda_1 \cdot \Delta z_2 \right) \times \left( \theta \cdot s - p_1 \right) \, \middle| \, s  \, \right] \end{align*}

Thus, informed traders’ continuation value is quadratic in the distance between their expectation of the fundamental value and the period t=1 price:

(28)   \begin{align*} \mathrm{H}_1 &= \text{Const.} + \underbrace{\frac{1}{4 \cdot \lambda_1}}_{\omega_1} \cdot \left( \mu_{v|s} - p_1 \right)^2 \end{align*}

which is consistent with the original linear quadratic guess. Boom.

7. Numerical Analysis

Given the analysis above, we could derive the correct values of all the other equilibrium coefficients if we knew the optimal \beta_0. To compute the equilibrium coefficient values, make an initial guess, \widehat{\beta}_0, and use this guess to compute the values of the other equilibrium coefficients:

(29)   \begin{align*} \widehat{\lambda}_0 &\leftarrow \frac{\widehat{\beta}_0 \cdot \sigma_v^2}{\widehat{\beta}_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \\ \widehat{\sigma}_{v|\Delta x_1}^2 &\leftarrow \left( \frac{\widehat{\beta}_0^2 \cdot \sigma_{\epsilon}^2 + \sigma_{\Delta z}^2}{\widehat{\beta}_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \cdot \sigma_v^2 \\ \widehat{\sigma}_{s|\Delta x_1}^2 &\leftarrow \left( \frac{\sigma_{\Delta z}^2}{\widehat{\beta}_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \cdot \sigma_s^2 \\ \widehat{\lambda}_1 &\leftarrow \frac{1}{\sigma_{\Delta z}} \cdot \sqrt{ \frac{\theta}{2} \cdot \left( \widehat{\sigma}_{v|\Delta x_1}^2 - \frac{\theta}{2} \cdot \widehat{\sigma}_{s|\Delta x_1}^2 \right) } \end{align*}

Then, just iterate on the initial guess numerically until you find that:

(30)   \begin{align*} \widehat{\beta}_0 &= \frac{\theta}{2 \cdot \widehat{\lambda}_0} \cdot \left( \frac{1 - 2 \cdot \widehat{\omega}_1 \cdot \widehat{\lambda}_0}{1 - \widehat{\omega}_1 \cdot \widehat{\lambda}_0} \right) \end{align*}

since we know that \beta_0 must satisfy this condition in equilibrium.

The figure below plots the coefficient values at various levels of noise trader demand and signal error for inspection. Here is the code. The informed traders are more aggressive with there is more noise trader demand (i.e., moving across panels from left to right) and in the second trading period (i.e., blue vs red). The trade less aggressively as their signal quality degrades (i.e., moving within panel from left to right).


Fano’s Inequality and Resource Allocation

1. Motivation

This post describes Fano’s inequality. It’s not a particularly complicated result. After all, it first shows up on page 33 of Cover and Thomas (1991). However, I recently ran across the result again for the first time in a while, and I realized it had an interesting asset pricing implication.

Roughly speaking, what does inequality say? Suppose I need to make some decision, and you give me some news that helps me decide. Fano’s inequality gives a lower bound on the probability that I end up making the wrong choice as a function of my initial uncertainty and how informative your news was. What’s cool about the result is that it doesn’t place any restrictions on how I make my decision. i.e., it gives a lower bound on my best case error probability. If the bound is negative, then in principle I might be able to eliminate my decision error. If the bound is positive (i.e., binds), then there is no way for me to use the news you gave me to always make the right decision.

Now, back to asset pricing. We want accurate prices so that, in the words of Fama (1970), they can serve as “signals for resource allocation.” If we treat resource allocation as a discrete choice problem and prices as news, then Fano’s inequality applies and gives bounds on how effectively decision makers can use this information.

2. Notation

I start by laying out the notation. Imagine that a decision maker wants to predict the value of a random variable \widetilde{X} that can take on N possible values:

(1)   \begin{align*} \widetilde{X} \in \{ x_1,x_2,\ldots,x_N \} \end{align*}

e.g., you might think about the decision maker as a farmer and \widetilde{X} as the most profitable crop he can plant next fall. The probability that \widetilde{X} takes on each of the N values is given by:

(2)   \begin{align*} \mathrm{Pr}[\widetilde{X} = x_n] = p_n \end{align*}

Finally, I use the \mathrm{H}[\cdot] operator to denote the entropy of a random variable:

(3)   \begin{align*} \mathrm{H}[\widetilde{X}] &= - \sum_{n=1}^N p_n \cdot \log_2(p_n) \end{align*}

3. Main Result

Now, imagine that the farmer knows which crop currently has the highest futures price, \widetilde{Y}, and that this price signal is correlated with the correct choice of which crop to plant:

(4)   \begin{align*} \mathrm{Cor}[\widetilde{X},\widetilde{Y}] \neq 0 \end{align*}

The farmer could use this information to make an educated guess about the right crop to plant:

(5)   \begin{align*} f(\widetilde{Y}) \in \{ x_1, x_2, \ldots, x_N\} \end{align*}

e.g., his rule might be something simple like, “Plant the crop with the highest futures price today.” Or, it might be something more complicated like, “Plant the crop with the highest futures price today unless it’s corn in which case plant soy beans.” I am agnostic about what function f(\cdot) the farmer uses to turn price signals into crop decisions. Let \widetilde{Z} denote whether or not he got the decision right though:

(6)   \begin{align*} \widetilde{Z} &= \begin{cases} 0 &\text{if } f(\widetilde{Y}) = \widetilde{X} \\ 1 &\text{else } \end{cases} \end{align*}

Fano’s inequality links the probability that the farmer makes the wrong crop choice, \mathrm{E}[\widetilde{Z}], to his remaining entropy after seeing the price signals, \mathrm{H}[\widetilde{X}|\widetilde{Y}]:

(7)   \begin{align*} 1 + \mathrm{E}[\widetilde{Z}] \cdot \log_2(N) \geq \mathrm{H}[\widetilde{X}|\widetilde{Y}] \end{align*}

4. Quick Proof

The result follows from applying the entropy chain rule in 2 different ways. Let’s think about the entropy of the joint distribution of errors and crop choices, (\widetilde{Z},\widetilde{X}), after the farmer see the price signal, \widetilde{Y}. The entropy chain rule says that we can rewrite this quantity as:

(8)   \begin{align*} \mathrm{H}[\widetilde{Z},\widetilde{X}|\widetilde{Y}] &= \mathrm{H}[\widetilde{X}|\widetilde{Y}] + \underbrace{\mathrm{H}[\widetilde{Z}|\widetilde{X},\widetilde{Y}]}_{=0} \end{align*}

where the second term on the right-hand side is 0 since if you know the correct crop choice you will never make an error. Yet, we can also rewrite \mathrm{H}[\widetilde{Z},\widetilde{X}|\widetilde{Y}] as follows using the exact same chain rule:

(9)   \begin{align*} \mathrm{H}[\widetilde{Z},\widetilde{X}|\widetilde{Y}] &= \mathrm{H}[\widetilde{Z}|\widetilde{Y}] + \mathrm{H}[\widetilde{X}|\widetilde{Z},\widetilde{Y}] \end{align*}

It’s not like either \widetilde{Z} or \widetilde{X} has a privileged position in the joint distribution (\widetilde{Z},\widetilde{X})!

Applying the chain rule in 2 ways then leaves us with the equation:

(10)   \begin{align*} \mathrm{H}[\widetilde{Z}|\widetilde{Y}] + \mathrm{H}[\widetilde{X}|\widetilde{Z},\widetilde{Y}] & = \mathrm{H}[\widetilde{X}|\widetilde{Y}] \end{align*}

The first term on the left-hand side is bounded above by:

(11)   \begin{align*} \mathrm{H}[\widetilde{Z}|\widetilde{Y}] \leq \mathrm{H}[\widetilde{Z}] \leq 1 \end{align*}

since conditioning on a random variable weakly lowers entropy and a binary choice variable has at most 1 bit of information. Rewriting the second term on the left-hand side as follows:

(12)   \begin{align*} \mathrm{H}[\widetilde{X}|\widetilde{Z},\widetilde{Y}] &= \mathrm{Pr}[\widetilde{Z} = 0] \cdot \underbrace{\mathrm{H}[\widetilde{X}|\widetilde{Z} = 0,\widetilde{Y}]}_{=0} + \mathrm{Pr}[\widetilde{Z} = 1] \cdot \mathrm{H}[\widetilde{X}|\widetilde{Z} = 1,\widetilde{Y}] \end{align*}

then gives the desired result since the uniform distribution maximizes a discrete variable’s entropy:

(13)   \begin{align*} \mathrm{H}[\widetilde{X}|\widetilde{Z} = 1,\widetilde{Y}] \leq \log_2(N - 1) \leq \log_2(N) \end{align*}

5. Application

Now let’s consider an application. Suppose that the farmer can plant N=4 different crops: 1) corn, 2) wheat, 3) soy, and 4) rice. Let \widetilde{X} denote the most profitable of these crops to plant, and let \widetilde{Y} denote the crop with the highest current futures price. Suppose the choice and price variables have the following joint distribution:

(14)   \begin{align*} \bordermatrix{~   & x_1           & x_2           & x_3           & x_4           \cr               y_1 & \sfrac{1}{8}  & \sfrac{1}{16} & \sfrac{1}{32} & \sfrac{1}{32} \cr               y_2 & \sfrac{1}{16} & \sfrac{1}{8}  & \sfrac{1}{32} & \sfrac{1}{32} \cr               y_3 & \sfrac{1}{16} & \sfrac{1}{16} & \sfrac{1}{16} & \sfrac{1}{16} \cr               y_4 & \sfrac{1}{4}  & 0             & 0             & 0             \cr} \end{align*}

e.g., this table reads that 25{\scriptstyle \%} of time rice has the highest futures contract price, and conditional on rice having the highest future price the farmer should always plant corn the following year. In this world, the conditional entropy of the farmer’s decision after seeing the price signal is given by:

(15)   \begin{align*} \mathrm{H}[\widetilde{X}|\widetilde{Y}] &= \sum_{n=1}^4 \mathrm{Pr}[\widetilde{Y} = y_n] \cdot \mathrm{H}[\widetilde{X}|\widetilde{Y} = y_n] = \sfrac{11}{8} \end{align*}

in units of bits.

Here’s the punchline. If we rearrange Fano’s inequality to isolate the error rate on the left-hand side, we see that there is no way for the farmer to plant the right crop more that \sfrac{13}{16} \approx 81{\scriptstyle \%} of the time:

(16)   \begin{align*} \mathrm{E}[Z]  \geq \frac{\mathrm{H}[X|Y] - 1}{\log_2(N)} =    \frac{\sfrac{11}{8} - 1}{\log_2(4)}  =    \frac{3}{16} \end{align*}

What’s more, this result is independent of how the farmer incorporates the price information.

Wavelet Variance

1. Motivation

Imagine you’re a trader who’s about to put on a position for the next month. You want to hedge away the risk in this position associated with daily fluctuations in market returns. One way that you might do this would be to short the S&P 500 since E-mini contracts are some of the most liquid in the world.


Flash_CrashBut… how much of the variation in the index’s returns is due to fluctuations at the daily horizon? e.g., the blue line in the figure to the right shows the minute-by-minute price of the E-mini contract on May 6th, 2010 during the flash crash. Over the course of 4 minutes, the contract price fell 3{\scriptstyle \%}! It then rebounded back to nearly its original position over the next hour. Clearly, if most of the fluctuations in the E-mini S&P 500 contract value is due to shocks on the sub-hour time scale, this contract will do a poor job hedging away daily market risk.

This post demonstrates how to decompose the variance of a time series (e.g., the minute-by-minute returns on the E-mini) into horizon specific components using wavelets. i.e., using the wavelet variance estimator allows you to ask the questions: “How much of the variance is coming from fluctuations on the scale of 16 minutes? 1 hour? 1 day? 1 month?” I then investigate how this wavelet variance approach compares to other methods financial economists might employ such as auto-regressive functions and spectral analysis.

2. Wavelet Analysis

In order to explain how the wavelet variance estimator works, I first need to give a quick outline of how wavelets work. Wavelets allow you to decompose a signal into components that are independent in both the time and frequency domains. This outline will be as bare bones as possible. See Percival and Walden (2000) for an excellent overview of the topic.

Imagine you’ve got a time series of just T = 8 returns:

(1)   \begin{align*} \mathbf{r} = \begin{bmatrix} r_0 & r_1 & r_2 & r_3 & r_4 & r_5 & r_6 & r_7 \end{bmatrix}^{\top} \end{align*}

and assume for simplicity that these returns have mean \mathrm{E}[r_t] = \mu_r = 0. One thing that you might do with this time series is estimate a regression with time fixed effects: r_t = \sum_{t'=0}^7 \vartheta_{t'} \cdot 1_{\{\mathrm{Time}(r_t) = t'\}}. Here is another way to represent the same regression:

(2)   \begin{align*} \begin{bmatrix} r_0 \\ r_1 \\ r_2 \\ r_3 \\ r_4 \\ r_5 \\ r_6 \\ r_7 \end{bmatrix} &= \begin{pmatrix}  1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\  0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\  0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\  0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\  0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\  0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix} \begin{bmatrix} \vartheta_0 \\ \vartheta_1 \\ \vartheta_2 \\ \vartheta_3 \\ \vartheta_4 \\ \vartheta_5 \\ \vartheta_6 \\ \vartheta_7 \end{bmatrix} \end{align*}

It’s really a trivial projection since \vartheta_t = r_t. Call the projection matrix \mathbf{F} for “fixed effects” sot that \mathbf{r} = \mathbf{F}{\boldsymbol \vartheta}.

Obviously, the above time fixed effect model would be a bit of a silly thing to estimate, but notice that the projection matrix \mathbf{F} has an interesting property. Namely, each column is orthonormal:

(3)   \begin{align*}  \langle \mathbf{f}(t) | \mathbf{f}(t') \rangle = \begin{cases} 1 &\text{if } t = t' \\ 0 &\text{else } \end{cases} \end{align*}

It’s orthogonal because \langle \mathbf{f}(t) | \mathbf{f}(t') \rangle = 0 unless t = t'. This requirement implies that each column in the projection matrix is picking up different information about \mathbf{r}. It’s normal because \langle \mathbf{f}(t) | \mathbf{f}(t) \rangle is normalized to equal 1. This requirement implies that the projection matrix is leaving the magnitude of \mathbf{r} unchanged. The time fixed effects projection matrix, \mathbf{F}, compares each successive time period, but you can also think about using other orthonormal bases.

e.g., the Haar wavelet projection matrix compares how the 1st half of the time series differs from the 2nd half, how the 1st quarter differs from the 2nd quarter, how the 3rd quarter differs from the 4th quarter, how the 1st eighth differs from the 2nd eighth, and so on… For the 8 period return time series, let’s denote the columns of the wavelet projection matrix as:

(4)   \begin{align*} \mathbf{w}(3,0) &= \sfrac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix}^{\top} \\ \mathbf{w}(2,0) &= \sfrac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 & 1 & 1 & -1 & -1 & -1 & -1 \end{bmatrix}^{\top} \\ \mathbf{w}(1,0) &= \sfrac{1}{\sqrt{4}} \cdot \begin{bmatrix} 1 & 1 & -1 & -1 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(1,1) &= \sfrac{1}{\sqrt{4}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 1 & -1 & -1 \end{bmatrix}^{\top} \\ \mathbf{w}(0,0) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,1) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 1 & -1 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,2) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,3) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \end{bmatrix}^{\top} \end{align*}

and simple inspection shows that each column is orthonormal:

(5)   \begin{align*}  \langle \mathbf{w}(h,i) | \mathbf{w}(h',i') \rangle = \begin{cases} 1 &\text{if } h = h', \; i = i' \\ 0 &\text{else } \end{cases} \end{align*}

Let’s look at a concrete example. Suppose that we want to project the vector:

(6)   \begin{align*} \mathbf{r} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \end{align*}

onto the wavelet basis:

(7)   \begin{align*} \begin{bmatrix} r_0 \\ r_1 \\ r_2 \\ r_3 \\ r_4 \\ r_5 \\ r_6 \\ r_7 \end{bmatrix} &= \begin{pmatrix}  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & \sfrac{1}{\sqrt{4}}  & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & \sfrac{1}{\sqrt{4}}  & 0 & -\sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & -\sfrac{1}{\sqrt{4}} & 0 & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & -\sfrac{1}{\sqrt{4}} & 0 & 0 & -\sfrac{1}{\sqrt{2}} & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & \sfrac{1}{\sqrt{4}}  & 0 & 0 & \sfrac{1}{\sqrt{2}} & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & \sfrac{1}{\sqrt{4}}  & 0 & 0 & -\sfrac{1}{\sqrt{2}} & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & -\sfrac{1}{\sqrt{4}} & 0 & 0 & 0 & \sfrac{1}{\sqrt{2}} \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & -\sfrac{1}{\sqrt{4}} & 0 & 0 & 0 & -\sfrac{1}{\sqrt{2}} \end{pmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5 \\ \theta_6 \\ \theta_7 \end{bmatrix} \end{align*}

What would the wavelet coefficients {\boldsymbol \theta} look like? Well, a little trial and error shows that:

(8)   \begin{align*} {\boldsymbol \theta} = \begin{bmatrix} \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{4}} & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \end{bmatrix}^{\top} \end{align*}

since this is the only combination of coefficients that satisfies both r_0 = 1:

(9)   \begin{align*} 1 &=  r_0 \\ &= \frac{1}{\sqrt{8}} \cdot w_0(3,0) + \frac{1}{\sqrt{8}} \cdot w_0(2,0) + \frac{1}{\sqrt{4}} \cdot w_0(1,0) + \frac{1}{\sqrt{2}} \cdot w_0(0,0) \\ &= \frac{1}{8} + \frac{1}{8} + \frac{1}{4} + \frac{1}{4} \end{align*}

and r_t = 0 for all t > 0.

What’s cool about the wavelet projection is that the coefficients represent effects that are isolated in both the frequency and time domains. The index h=0,1,2,3 denotes the \log_2 length of the wavelet comparison groups. e.g. the 4 wavelets with h=0 compare 2^0 = 1 period increments: the 1st period to the 2nd period, the 3rd period to the 4th period, and so on… Similarly, the wavelets with h=1 compare 2^1 = 2 period increments: the 1st 2 periods to the 2nd 2 periods and the 3rd 2 periods to the 4th 2 periods. Thus, the h captures the location of the coefficient in the frequency domain. The index i=0,\ldots,I_h signifies which comparison groups at horizon h we are looking at. e.g., when h=0, there are I_0 = 4 = \sfrac{8}{2^{0+1}} different comparisons to be made. Thus, the i captures the location of the coefficient in the time domain.

3. Wavelet Variance

With these basics in place, it’s now easy to define the wavelet variance of a time series. First, I massage the standard representation of a series’ variance a bit. The variance of our 8 term series is defined as:

(10)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot \sum_t r_t^2  \end{align*}

since \mu_r = 0. Using the tools from the section above, let’s rewrite \mathbf{r} = \mathbf{W}{\boldsymbol \theta}. This means that the variance formula becomes:

(11)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot \mathbf{r}^{\top} \mathbf{r} =  \frac{1}{T} \cdot \left( \mathbf{W} {\boldsymbol \theta} \right)^{\top} \left( \mathbf{W} {\boldsymbol \theta} \right) \end{align*}

But I know that \mathbf{W}^{\top} \mathbf{W} = \mathbf{I} since each of the columns is orthonormal. Thus:

(12)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot {\boldsymbol \theta}^{\top} {\boldsymbol \theta} = \frac{1}{T} \cdot \sum_{h,i} \theta(h,i)^2 \end{align*}

This representation gives the variance of a series as an average of squared wavelet coefficients.

The sum of the squared wavelet coefficients at each horizon, h, is then an interesting object:

(13)   \begin{align*} V(h) &= \frac{1}{T} \cdot \sum_{i=0}^{I_h} \theta(h,i)^2 \end{align*}

since V(h) denotes the fraction of the total variance of the time series explained by comparing successive periods of length 2^h. I refer to V(h) as the wavelet variance of a series at horizon h. The sum of the wavelet variances at each horizon gives total variance:

(14)   \begin{align*} \sum_{h=0}^H V(h) &= \sigma_r^2 \end{align*}

4. Numerical Example

Let’s take a look at how the wavelet variance of a time series behaves out in the wild. Here’s the code I used to create the figures: . Specifically, let’s study the simulated data plotted below which consists of 63 days of minute-by-minute return data with day-specific shocks:

(15)   \begin{align*} r_t &= \mu_{r,t} + \sigma_r \cdot \epsilon_t \qquad \text{with} \qquad \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1) \end{align*}

where the volatility of the process is given by \sigma_r = 0.01{\scriptstyle \mathrm{bp}/\sqrt{\mathrm{min}}} and there is a 5{\scriptstyle \%} probability of realizing a \mu_{r,t} = \pm 0.001{\scriptstyle \mathrm{bp}/\mathrm{min}} shock on any given day. The 4 days on which the data realized a shock are highlighted in red. These minute-by-minute figures amount to a 0{\scriptstyle \%/\mathrm{yr}} annualized return and a 31{\scriptstyle \%/\mathrm{yr}} annualized volatility.


The figure below then plots the wavelet coefficients, {\boldsymbol \theta}, at each horizon associated with this time series. A trading day is 6.5 \times 60 = 390{\scriptstyle \mathrm{min}}, so notice the spikes in the coefficient values in the h=6,7,8 panels near the day-specific shock dates corresponding to comparing successive 64, 128, and 256 minute intervals. The remaining variation in the coefficient levels comes from the underlying white noise process \epsilon_t. Because the break points in the wavelet projection affect the estimated coefficients, each data point in the plot actually represents the average of the coefficient estimates \theta_t(h,i) at a given point of time for all possible starting dates. See Percival and Walden (2000, Ch. 5) on the maximal overlap discrete wavelet transform for details.


Finally, I plot the \log of the wavelet variance at each horizon h for both the simulated return process (red) and a white noise process with an identical mean and variance (blue). Note that I’ve switched from \log_2 to \log_e on the x-axis here, so a spike in the amount of variance at h=6 corresponds to a spike in the amount of variance explained by successive e^{6} \approx 400{\scriptstyle \mathrm{min}} increments. This is exactly what you’d expect for day-specific shocks which have a duration of 390{\scriptstyle \mathrm{min}} as indicated by the vertical gray line. The wavelet variance of an appropriately scaled white noise process gives a nice comparison group. To see why, note that for covariance stationary processes like white noise, the wavelet variance at a particular horizon is related to the power spectrum as follows:

(16)   \begin{align*} V(h) &\approx 2 \cdot \int_{\sfrac{1}{2^{h+1}}}^{\sfrac{1}{2^h}} S(f) \cdot df \end{align*}

Thus, the wavelet variance of white noise should follow a power law with:

(17)   \begin{align*} V(h) &\propto 2^{-h} \end{align*}

giving a nice smooth reference point in plots.


5. Comparing Techniques

I conclude by considering how the wavelet variance statistic compares to other ways that a financial economist might look for horizon specific effects in data. I consider 2 alternatives: auto-regressive models and spectral density estimators. First, consider estimating the auto-regressive model below with lags \ell = 1,2,\ldots,L:

(18)   \begin{align*} r_t &= \sum_{\ell=1}^L C(\ell) \cdot r_{t-\ell} + \xi_t \qquad \text{where} \qquad \xi_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\xi}^2) \end{align*}

The left-most panel of the figure below reports the estimated values of C(\ell) for lags \ell = 1,2,\ldots,420 using the simulated data (red) as well as a scaled white noise process (blue). Just as before, the vertical grey line denotes the number of minutes in a day. There is no meaningful difference between the 2 sets of coefficients. The reason is that the day-specific shocks are asynchronous. They aren’t coming at regular intervals. Thus, no obvious lag structure can emerge from the data.


Next, let’s think about estimating the spectral density of \mathbf{r}. This turns out to be the exact same exercise as the auto-regressive model estimation in different clothing. As shown in an earlier post, it’s possible to flip back and forth between the coefficients of an \mathrm{AR}(L) process and its spectral density via the relationship:

(19)   \begin{align*} S(f) &= \frac{\sigma_{\epsilon}^2}{\left( \, 1 - \sum_{\ell=1}^L C(\ell) \cdot e^{-i \cdot 2 \cdot \pi \cdot f \cdot \ell} \, \right)^2} \end{align*}

This one-to-one mapping between the frequency domain and the time domain for covariance stationary processes is known as the Wiener–Khinchin theorem with \sigma_x^2 = \int_{-\sfrac{1}{2}}^{\sfrac{1}{2}} S(f) \cdot df. Thus, the spectral density plot just reflects the same random noise as the auto-regressive model coefficients because of the same issue with asynchrony. The most interesting features of the middle panel occur at really high frequencies which have nothing to do with the day-specific shocks.

Here’s the punchline. The wavelet variance is the only estimator of the 3 which can identify horizon-specific contributions to a time series’ variance which are not stationary.

WSJ Article Subject Tags

1. Motivation

Screen Shot 2014-07-18 at 5.44.43 PM

This post investigates the distribution of subject tags for Wall Street Journal articles that mention S&P 500 companies. e.g., a December 2009 article entitled, When Even Your Phone Tells You You’re Drunk, It’s Time to Call a Taxi, about a new iPhone app that alerted you when you were too drunk to drive had the meta data to the right. The subject tags are essentially article keywords. I collect every article that references an S&P 500 company over the period from 01/01/2008 to 12/31/2012. It is an appendix to my paper, Local Knowledge in Financial Markets.

I find that there is substantial heterogeneity in how many different topics people write about when discussing a company even after controlling for the number of total articles. e.g., there were 87 articles in the WSJ referencing Garmin (GRMN) and 81 articles referencing Sprint (S); however, while there were only 87 different subject tags used in the articles about Garmin, there were 716 different subject tags used in the articles about Sprint! This finding is consistent with the idea that some firms face a much wider array of shocks than others. i.e., the width of the market matters.

2. Data Collection

The data are hand-collected from the ProQuest newspaper archive by an RA. Data collection process for an example company, Agilent Technologies (A), is summarized in the 3 figures below. First, we searched for each company included in the S&P 500 from 01/01/2008 to 12/31/2012 [list]. Then, after each query, we restricted the results to articles found in the WSJ. Finally, we downloaded the articles and meta data in HTML format.

After the RA collected all of the data, I used a Python script to parse the resulting HTML files into a form I could manage in R. Roughly 4000 of the downloaded articles were duplicates resulting from the WSJ publishing the same article in different editions. I identify these observations by checking for articles published on the same day with identical word counts about the same companies. I tried using Selenium to automate the data collection process, but the ProQuest web interface proved too finicky.

3. Summary Statistics

My data set contains 106{\scriptstyle \mathrm{k}} articles over 5 years about 542 companies. Many articles reference multiple S&P 500 companies. The figure below plots the total number of articles in the database per month. There is a steady downward trend. The first part of the sample was the height of the financial crisis, so this makes sense. As markets have calmed down, journalists have devoted fewer articles to corporate news relative to other things such as politics and sports.


Articles are not evenly distributed across companies as shown by the figure below. While the median company is only referenced in 21 articles over the sample period, the 5 most popular companies (United Parcel Service [UPS], Apple [AAPL], Goldman Sachs [GS], Citibank [C], and Ford [F]) are all referenced in at least 1922 different articles a piece. By comparison, the least popular 1{\scriptstyle \%} of companies are mentioned in only 1 article in 5 years.


Counting subject tags is a bit less straight-forward than counting articles. I not count tags that are specific to the WSJ rather than the company. e.g., tags containing “(wsj)” flagging daily features like “Abreast of the market (wsj).” I also remove missing subjects. It’s worth pointing out that sometimes the meta data for an article doesn’t contain any subject information. After restrictions, the data contain 10{\scriptstyle \mathrm{k}} unique subject tags.

The distribution of subject tag counts per month is similar to that of article counts as shown in the figure below but with a less pronounced downward trend. To create this figure, I count the number of unique subject tags used each month. e.g., if “technology shock” is used 2 times in Jan 2008, then this counts as 1 of the 1591 tags used in this month; whereas, if “technology shock” is then used again on Feb 1st 2008, then I count this 3rd observation towards the total in February. Thus, the sum of the points in the time series will exceed 10{\scriptstyle \mathrm{k}}. Also, note that different articles can have identical subject tags.


As shown in the figure below, the distribution of subject tags used to describe articles about each company is less skewed than the actual article count for each company. There are 179 different subject tags used in the 21 articles about the median S&P 500 company during the sample period. The most tagged companies have 10 times as many subjects as the median firm; whereas, the most written about companies are referenced in 100 times as many articles as the median firm.


4. Articles per Tag

In order for the distribution of tags per company to be less skewed than the distribution of articles per company, it’s got to be the case that some tags are used in lots of articles. This is exactly what’s going on in the data. The figure below shows that the median subject tag is used in only 3 articles and the bottom 25{\scriptstyle \%} of tags are used in only 1 article; however, the top 1{\scriptstyle \%} of tags are used in 466 articles or more. e.g., there are roughly 100 tags out of the 10{\scriptstyle \mathrm{k}} unique subject tags in my data set that are used 500 times are more. Likewise, there are well over 3000 that are used only once!


This fact strongly supports the intuition that companies–even huge companies like those in the S&P 500—are constantly hit with new and different shocks. Traders have to figure out which aspect of the company matters. This is clearly not an easy problem to solve. Lot’s of ideas are thrown around. Many of them must be eitehr short lived or wrong. Roughly 1 out of every 4 topics worth discussing is only worth discussing once.

5. Coverage Depth

I conclude this post by looking at the variation in the number of subject tags across firms with a similar number of articles. e.g., I want to know if there are pairs of firms which journalist spend roughly the same amount of time talking about, but which get covered in very different ways. It turns out there are. The Garmin and Sprint example from the introduction is one such case. The figure below shows that there are many more. i.e., it shows that companies that are referenced in more articles also have more subject tag descriptors, but conditional on the number of articles there is still a lot of variation. The plot is on a \log_{10} \times \log_{10} scale, so a 1 tick vertical movement means a factor of 10 difference between the number of tags for 2 firms with similar article counts. Looking at the figure, it’s clear that this sort of variation is the norm.


Randomized Market Trials

1. Motivation

How much can traders learn from past price signals? It depends on what kind of assets sell. Suppose that returns are (in part) a function of K = \Vert {\boldsymbol \alpha} \Vert_{\ell_0} different feature-specific shocks:

(1)   \begin{align*} r_n &= \sum_{q=1}^Q \alpha_q \cdot x_{n,q} + \epsilon_n \qquad \text{with} \qquad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}

If {\boldsymbol \alpha} is identifiable, then different values of {\boldsymbol \alpha} have to produce different values of r_n. This is only the case if assets are sufficiently different from one another. e.g., consider the analogy to randomized control trials. In an RCT, randomizing which subjects get thrown in the treatment and control groups makes it exceptionally unlikely that, say, all the people in the treatment group will by chance happen to all have some other common trait that actually explains their outcomes. Similarly, randomizing which assets get sold makes makes it exceptionally unlikely that 2 different choices of {\boldsymbol \alpha} and {\boldsymbol \alpha}' can explain the observed returns.

This post sketches a quick model relating this problem to housing prices. To illustrate, imagine N = 4 houses have sold at a discount in a neighborhood that looks like this:


The shock might reflect a structural change in the vacation home market whereby there is less disposable income to buy high end units—i.e., a permanent shift. Alternatively, the shock might have been due to a couple of out-of-town second house buyers needing to sell quickly—i.e., a transient effect. The houses in the picture above are all vacation homes of a similar quality with owners living in LA. Since there is so little variation across units, both these explanations are observationally equivalent. Thus, the asset composition affects how informative prices are in an important way. The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks.

2. Toy Model

Suppose you’ve seen N sales in the area. Most of the prices looked just about right, but some of the houses sold for a bit more than you would have expected and some sold for a bit less than you would have expected. You’re trying to decide whether or not to buy the (N+1)th house if the transaction costs are \mathdollar c today:

(2)   \begin{align*} U &= \max_{\{\text{Buy},\text{Don't}\}} \left\{ \, \mathrm{E}\left[ r_{N+1} \right] - \frac{\gamma}{2} \cdot \mathrm{Var}\left[ r_{N+1} \right] - c, \, 0 \, \right\} \end{align*}

You will buy the house if your risk adjusted expectation of its future returns exceeds the transaction costs, \mathrm{E}[r_{N+1}] - \sfrac{\gamma}{2} \cdot \mathrm{Var}[r_{N+1}] \geq c.

This problem hinges on your ability to estimate {\boldsymbol \alpha}. What’s the best you could ever hope to do? Well, suppose you knew which K features mattered ahead of time and the elements of \mathbf{X} were given by x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{K}). In this setting, your average estimation error per relevant feature is given by:

(3)   \begin{align*} \Omega^\star = \mathrm{E}\left[ \, \frac{1}{K} \cdot \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 \, \right] &= \frac{K \cdot \sigma_{\epsilon}^2}{N} \end{align*}

i.e., it’s as if you ran an OLS regression of the N price changes on the K relevant columns of \mathbf{X}. You will buy the house if:

(4)   \begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( \frac{K + N}{N}  \right) \cdot \sigma_{\epsilon}^2 &\geq c \end{align*}

In the real world, however, you generally don’t know which K features are important ahead of time and each house’s amenities are not taken as an iid draw. Instead, you must solve \ell_1-type inference problem:

(5)   \begin{align*} \widehat{\boldsymbol \alpha} &= \arg \min_{\boldsymbol \alpha} \sum_{n=1}^N \left( r_n - \mathbf{x}_n^{\top} {\boldsymbol \alpha} \right)^2 \qquad \text{s.t.} \qquad \left\Vert {\boldsymbol \alpha} \right\Vert_{\ell_1} \leq \lambda \cdot \sigma_{\epsilon} \end{align*}

with a correlated measurement matrix, \mathbf{X}, using something like LASSO. In this setting, you face feature selection risk. i.e., you might focus on the wrong causal explanation for the past price movements. If \Omega^{\perp} denotes your estimation error when each of the elements x_{n,q} are drawn independently and \Omega denotes your estimation error in the general case when \rho(x_{n,q},x_{n',q}) \neq 0, then:

(6)   \begin{align*} \Omega^{\star} \leq \Omega^{\perp} \leq \Omega \end{align*}

Since your estimate of \widehat{\boldsymbol \alpha} is unbiased, feature selection risk will simply increase \mathrm{Var}[r_{N+1}] making it less likely that you will buy the house in this stylized model:

(7)   \begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( K \cdot \Omega + \sigma_{\epsilon}^2 \right) &\geq c \end{align*}

More generally, it will make prices slower to respond to shocks and allow for momentum.

3. Matrix Coherence

Feature selection risk is worst when assets all have really correlated features. Let \mathbf{X} denote the (N \times Q)-dimensional measurement matrix containing all the features of the N houses that have already sold in the market:

(8)   \begin{align*} \mathbf{X} &= \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,Q} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,Q} \\ \vdots  & \vdots  & \ddots & \vdots  \\ x_{N,1} & x_{N,2} & \cdots & x_{N,Q} \\ \end{bmatrix} \end{align*}

Each row represents all of the features of the nth house, and each column represents the level to which the N assets display a single feature. Let \widetilde{\mathbf{x}}_q denote a unit-normed column from this measurement matrix:

(9)   \begin{align*} \widetilde{\mathbf{x}}_q &= \frac{\mathbf{x}_q}{\sqrt{\sum_{n=1}^N x_{n,q}^2}} \end{align*}

I use a measure of the coherence of \mathbf{X} to quantify the extent to which all of the assets in a market have similar features.

(10)   \begin{align*} \mu(\mathbf{X}) &= \max_{q \neq q'} \left\vert \left\langle \widetilde{\mathbf{x}}_q, \widetilde{\mathbf{x}}_{q'} \right\rangle \right\vert \end{align*}

e.g., the coherence of a matrix with x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N}) is roughly \sqrt{2 \cdot \log(Q)/N} corresponding to the red line in the figure below. As the correlation between elements in the same column increases, the coherence increases since different terms in the above cross-product are less likely to cancel out.


4. Selection Risk

There is a tight link between the severity of the selection risk and how correlated asset features are. Specifically, Ben-Haim, Eldar, and Elad (2010) show that if

(11)   \begin{align*} \alpha_{\min} \cdot \left( 1 - \{2 \cdot K - 1\} \cdot \mu(\mathbf{X}) \right) &\geq 2 \cdot \sigma_{\epsilon} \cdot \sqrt{2 \cdot (1 + \xi) \cdot \log(Q)} \end{align*}

for some \xi > 0, then:

(12)   \begin{align*} \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 &\leq \frac{2 \cdot (1 + \xi)}{(1 - (K-1)\cdot \mu(\mathbf{X}))^2} \times K \cdot \sigma_{\epsilon}^2 \cdot \log(Q) = \Omega \end{align*}

with probability at least:

(13)   \begin{align*} 1 - Q^{-\xi} \cdot \left( \, \pi \cdot (1 + \xi) \cdot \log(Q) \, \right)^{-\sfrac{1}{2}} \end{align*}

where \alpha_{\min} = |\arg \min_{q \in \mathcal{K}} \alpha_q|. Let’s plug in some numbers. If \alpha_{\min} = 0.10 and \sigma_{\epsilon} = 0.05, then the result means that \Vert \widehat{\boldsymbol \alpha} - {\boldsymbol \alpha} \Vert_{\ell_2}^2 is less than 0.185 \times K \cdot \log(Q) with probability \sfrac{3}{4}.

There are a couple of things worth pointing out here. First, the recovery bounds only hold when \mathbf{X} is sufficiently incoherent:

(14)   \begin{align*} \mu(\mathbf{X}) < \frac{1}{2 \cdot K - 1} \end{align*}

i.e., when the assets are too similar, we can’t learn anything concrete about which amenity-specific shocks are driving the returns. Second, the free parameter \xi > 0 links the probability of seeing an error rate outside the bounds, p, to the number of amenities that houses have:

(15)   \begin{align*} \xi &\approx \frac{\log(\sfrac{1}{p}) - \frac{1}{2} \cdot \log\left[ \pi \cdot \log Q \right]}{\sfrac{1}{2} + \log(Q)} \end{align*}

If you want to lower this probability, you need to either use a larger constant or decrease the number of amenities. For \xi large enough we can effectively regard the error bounds as the variance. Importantly, this quantity is increasing in the coherence of the measurement matrix. i.e., when assets are more similar, I am less sure that I am drawing the correct conclusion from past returns.

5. Empirical Predictions

The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks. e.g., imagine studying the price paths of 2 neighborhoods, A and B, which have houses of the exact same value, \mathdollar v. In neighborhood A, each of the houses has a very different collection of amenities whose values sum to \mathdollar v; whereas, in neighborhood B, each of the houses has the exact same amenities whose values sum to \mathdollar v. e.g., you can think about neighborhood A as pre-war and neighborhood B as tract housing. The theory says that the price of houses in the neighborhood B should respond slower to amenity-specific value shocks because houses have more correlated amenities—i.e., \Omega is larger. As a result, home prices in neighborhood B should also display more momentum… though this is not in the toy model above.