A Model of Hard-to-Diagnose Mispricings

1. Introduction

Important market events often have a variety of interpretations. For example, a recent Financial Times article outlined several different readings Facebook’s “feeble showing… in the weeks since its \mathdollar 16{\scriptstyle \mathrm{bn}} initial public offering”. “Maybe Morgan Stanley, which organized the IPO, got complacent. Maybe Facebook neglected to adapt its platform fully to the world of mobile devices. Maybe, if we are to believe the Los Angeles Times, the company, for all its 900{\scriptstyle \mathrm{m}} users, is ‘losing its cool’.” The article then tossed another hat into the ring. “Those explanations are wrong. There may be a simpler explanation: political risk… Facebook is less a revolution in technology than a revolution in property rights. It is to social life what enclosure was to grazing. Fed-up users might begin to question Facebook’s claim to full ownership of so much valuable personal information that they, the public, have generated.”

Whatever you think the right answer is, one thing is clear: traders can hold the exact same views for entirely different reasons. Moreover, while these views happen to line up for Facebook, they have wildly different implications for how a trader should behave in the rest of the market. For instance, if you think the poor performance was a result of Morgan Stanley’s hubris, then you should change the way you trade their upcoming IPOs. Alternatively, if you think the poor performance was a consequence of Facebook losing its cool, then you should change the way you trade Zynga. Finally, if you agree with the Financial Times reporter and think the poor performance was due to privacy concerns, then you should change the way you trade other companies, like Apple, which hoard users’ personal information.

Motivated by these observations, this post outlines an asset-pricing model where each asset has many plausibly relevant features, and, in order to turn a profit, arbitrageurs must diagnose which of these is relevant using past data.

2. Feature Space

I study a market with N = 4 assets. Let’s begin by looking at how I model each asset’s exposure to Q \gg 4 different features, each representing a different explanation for the asset’s performance. I use the indicator function, x_{n,q}, to capture whether or not asset n has exposure to the qth feature:

(1)   \begin{align*}   x_{n,q} &=   \begin{cases}     1 &\text{if asset $n$ has feature $q$}     \\     0 &\text{else}   \end{cases} \end{align*}

For example, while both National Semiconductor and Sequans Communications are in the semiconductor industry, x_{\text{NatlSemi},\text{SemiCond}} = 1 and x_{\text{Sequans},\text{SemiCond}} = 1, only National Semiconductor was involved in M&A rumors in Q1 2011, so x_{\text{NatlSemi},\text{M\&A}} = 1 but x_{\text{Sequans},\text{M\&A}} = 0. Feature exposures are common knowledge. Everyone knows each value in the (N \times Q)-dimensional matrix \mathbf{X}, so there is no uncertainty about whether or not National Semiconductor belongs to the semiconductor industry. Each asset’s fundamental value stems from its exposure to exactly half of the Q \gg 1 different payout-relevant features.

Fundamental values have a sparse representation in this space of Q features. Only K of the Q possible features actually matter:

(2)   \begin{align*}   Q \gg N \geq K \end{align*}

There are enough observations, N, to estimate the value of the K feature-specific shocks using OLS if you knew ahead of time which features to analyze; however, there are many more possible features, Q, than observations. Without an oracle, OLS is an ill-posed problem in this setting. This sparseness assumption embodies the idea that financial markets are large and diverse, so finding the right trading opportunity is a needle-in-a-haystack type problem.

For analytical convenience, I study the case with N = 4 and K = 2 where 2 of the assets have exposure to 1 of the feature-specific shocks and 2 of the assets have exposure to the other feature-specific shock. For example, if there is a shock to all big-box stores and to all companies based in Ohio, then there are no superstores based in Ohio like Big Lots in the list of N = 4 assets. This is the simplest possible model in which the feature-specific average matters and every asset has exposure to the same number of shocks.

If only 2 of the Q features actually realize shocks and each shock affects a separate subset of 2 firms, then there are:

(3)   \begin{align*} H = Q \times \frac{1}{6} \cdot (Q - 1) < {Q \choose 2} \end{align*}

possible combinations of shocks. There are Q different shocks to choose from for the first shock, and only \sfrac{1}{6}th of the remaining (Q - 1) shocks will not overlap assets with the first shock. I index each combination with h = 1,2,\ldots,H where h_\star denotes the true set of shocked features. Let \mathcal{Q} denote the set of all features and \mathcal{K}_h denote the 2 features associated with index h. Nature selects which of the H combinations of 2 features realizes feature-specific shocks uniformly at random:

(4)   \begin{align*}   \mathrm{Pr}(\mathcal{K}_{h_\star} = \mathcal{K}_h) = \sfrac{1}{H} \end{align*}

prior to the start of trading in period t=1.

3. Asset Structure

We just saw what might impact asset values. Let’s now examine how these features actually affect markets. I study a model where nature selects fundamental values, v_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_v^2), prior to the start of trading. And, these fundamental values are a function of 2 components: the particular feature-specific shock affecting each asset together with an idiosyncratic shock:

(5)   \begin{align*}   v_n &= \mathbf{x}_n^{\top}{\boldsymbol \beta} = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \beta_{0,n}   \qquad \text{with} \qquad   2 = \Vert {\boldsymbol \beta} \Vert_0 = \sum_{q=1}^Q 1_{\{\beta_q \neq 0\}} \end{align*}

where \beta_q denotes the extent to which the qth feature affects fundamental values:

(6)   \begin{align*}   \beta_q \overset{\scriptscriptstyle \mathrm{iid}}{\sim}    \begin{cases}     \mathrm{N}(0, \sigma_{\beta}^2) &\text{if } q \in \mathcal{K}_{h_\star}     \\     0 &\text{else}   \end{cases} \end{align*}

and \beta_{0,n} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \sigma_{\beta}^2) denotes the idiosyncratic shock. I use the 0th subscript to denote the idiosyncratic component of each stock for brevity, but always omit it when writing the (Q \times 1)-dimensional vector of feature-specific shocks, {\boldsymbol \beta}.

Each asset has exposure to only 1 of the feature-specific shocks since it has exposure to a random subset of \sfrac{1}{2} of all features. Thus, its fundamental volatility is given by:

(7)   \begin{align*}   \sigma_v^2 = 2 \cdot \sigma_\beta^2 \end{align*}

since both the feature-specific shocks and each asset’s idiosyncratic shock have variance \sigma_{\beta}^2. The main benefit of forcing each asset to have exposure to exactly 1 of the feature-specific shocks is that, under these conditions, every single one of the assets will have identical unconditional variance.

4. Naifs’ Objective Function

How does information about these feature-specific shocks gradually creep into prices? Naive asset-specific investors. There are 2 such investors studying each of the 4 stocks, one investor per shock. These so-called naifs choose how many shares to hold of a single asset, \theta_{n,t}^{(k)}, in order to maximize their mean-variance utility over end-of-market wealth:

(8)   \begin{align*}   \max_{\theta_{n,t}^{(k)} \in \mathrm{R}} \left\{ \, \mathrm{E}_{n,t}^{(k)}[w_{n,t}^{(k)}] - \frac{\gamma}{2} \cdot \mathrm{Var}_{n,t}^{(k)}[w_{n,t}^{(k)}] \, \right\} \quad \text{with} \quad w_{n,t}^{(k)} = (v_n - p_{n,t}) \cdot \theta_{n,t}^{(k)} \end{align*}

where \gamma > 0 is their risk-aversion parameter. The (k) superscript is necessary because there are 2 kinds of naifs trading each asset: one that has information about the feature-specific shock and one that has information about the idiosyncratic shock.

Naifs trading in each of the 4 assets see a private signal each period, \epsilon_{n,t}^{(k)}, about how a single shock affects their asset:

(9)   \begin{align*} \epsilon_{n,t}^{(k)} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(\beta_{k}, \sigma_{\epsilon}^2) \end{align*}

where k = 0 denotes a signal about stock n‘s idiosyncratic component. For example, a naif studying Target Corp might get a signal about how the company’s fundamental value will rise due to an industry-specific supply-chain management innovation (big-box store feature-specific shock). The other naive asset-specific investor studying Target might then get a signal about how the company’s fundamental value will fall due to the unexpected death of their CEO (Target-specific idiosyncratic shock).

I make 3 key assumptions about how the naifs solve their optimization problem. First, I assume that these investors believe each period that they’ll hold their portfolio until the liquidating dividend at time t=2. Second, I assume that, while naifs see private signals about the size of a feature-specific shock, they do not generalize this information and apply it to other assets with this feature. For example, the naif who realized that Target’s fundamental value will rise due to the supply-chain innovation won’t use this information to reevaluate the correct price of Wal-Mart. Third, these naive investors do not condition on current or past asset prices when forming their expectations. To continue the example, this same investor studying Target won’t analyze the average returns of all big-box stores to get a better sense of how big a value shock the industry-specific supply-chain innovation really was.

All 3 of these assumptions are motivated by bounded rationality. A naive asset-specific investor must use all his concentration just to figure out the implications of his private signals. With no cognitive capacity left to spare, he can’t implement a more complex, dynamic, trading strategy (first assumption), extend his insight to other companies (second assumption), or use prices to form a more sophisticated forecast of the liquidating dividend value (third assumption). These naifs behave similarly to the newswatchers from Hong and Stein (1999) and also neglect correlations in a similar fashion to Eyster and Weizsacker (2010).

5. Baseline Equilibrium

We now have enough structure to characterize a Walrasian equilibrium with private valuations. When no market-wide arbitrageurs are present, the price of each asset is given by:

(10)   \begin{align*} p_{n,1} &= \frac{1}{2} \cdot \left( \frac{\sigma_{\beta}^2}{\sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right) \cdot \left\{ \epsilon_{n,1}^{(0)} + \epsilon_{n,1}^{(k_n)} \right\} \\ p_{n,2} &= \frac{1}{2} \cdot \left( \frac{\sigma_{\beta}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right) \cdot \sum_{t=1}^2 \left\{ \epsilon_{n,t}^{(0)} + \epsilon_{n,t}^{(k_n)} \right\} \end{align*}

where k_n denotes the index of the particular shock affecting the nth asset.

What do these formulas mean for an arbitrageur? Suppose that the big-box store supply-chain innovation occurred and affected assets n=1,2. Naive asset-specific investors neglect the fact that they could use the average returns of assets in the big-box industry to refine their beliefs about the size of the shock. As an arbitrageur, you can profit from this neglected information by deducing the size of the shock from the industry average returns:

(11)   \begin{align*}   \widehat{\beta}_k = \frac{1}{2} \cdot \sum_{n = 1}^2 \Delta \tilde{p}_{n,1} &\sim \mathrm{N}\left( \beta_k, \, 2 \cdot \sigma_{\epsilon}^2 \right) \end{align*}

where \Delta \tilde{p}_{n,1} is given by:

(12)   \begin{align*}  \Delta \tilde{p}_{n,1} = 2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{\sigma_{\beta}^2}\right) \cdot \Delta p_{n,1} \end{align*}

Simply buy shares of the underpriced big-box stock whose p_{n,1} < \widehat{\beta}_k, and short shares of the overpriced big-box stock whose p_{n,1} > \widehat{\beta}_k.

Of course, in the real world, you wouldn’t have an oracle. You wouldn’t know ahead of time that the big-box store shock had occurred. Instead, you’d have to not only value the big-box store shock but also identify that the shock had occurred in the first place. Let’s now introduce arbitrageurs to the model and study this joint problem.

6. Arbitrageurs’ Objective Function

Arbitrageurs start out with no private information; however, unlike the naifs, they can observe all 4 asset returns in period t=1. They can then use this information to both value and identify feature-specific shocks, submitting market orders to maximize their risk-neutral utility over end-of-game wealth:

(13)   \begin{align*}     \max_{{\boldsymbol \theta}^{(a)} \in \mathrm{R}^4} \left\{ \, \mathrm{E}\left[ \, \sum_{n=1}^4 (v_n - p_{n,2}) \cdot \theta_n^{(a)} \, \middle| \, \widehat{\mathcal{K}} \, \right] \, \right\} \end{align*}

where \widehat{\mathcal{K}} is chosen as the model of the world that minimizes the arbitrageurs’ average prediction error over the assets’ fundamental values given the observed period t=1 prices, \Delta \tilde{\mathbf{p}}_1. In this model, much like that of Hong and Stein (1999), the naifs effectively serve as market makers.

Because there are more features than assets, Q \gg 4, arbitrageurs must engage in model selection a la Barberis, Shleifer, and Vishny (1998) or Hong, Stein, and Yu (2007). Choosing the right model of the world is their main challenge. It’s figuring out whether Facebook’s IPO failed due to Morgan Stanley’s complacency or due to under-appreciated political risks. If arbitrageurs knew which 2 features to analyze ahead of time, \mathcal{K}_{h_\star}, then their problem would be dramatically easier. It would be as if they had an oracle sitting on their shoulder interpreting market events for them. They would then be able to use the usual OLS techniques to form beliefs about the size of the 2 feature-specific shocks:

(14)   \begin{align*}   \widehat{\boldsymbol \beta}[\mathcal{K}_{h_\star}] &= \left( \mathbf{X}[\mathcal{K}_{h_\star}]^{\top}\mathbf{X}[\mathcal{K}_{h_\star}] \right)^{-1}\mathbf{X}[\mathcal{K}_{h_\star}]^{\top}\mathbf{y} \end{align*}

where \mathbf{X}[\mathcal{K}_h] is \mathbf{X} restricted to columns \mathcal{K}_h, and {\boldsymbol \beta}[\mathcal{K}_h] is {\boldsymbol \beta} restricted to rows \mathcal{K}_h. There is no hat over the choice of feature-specific shocks, \mathcal{K}. Only the {\boldsymbol \beta} has a hat over it. Only exact values of the shocks are unknown.

By contrast, the market-wide arbitrageurs in this model have to use some thresholding rule to cull the number of potential features down to a manageable number. They have to both select \widehat{\mathcal{K}} and estimate \widehat{\boldsymbol \beta}[\widehat{\mathcal{K}}]. While this daunting real-time econometrics problem is new to the asset-pricing literature, researchers and traders confront this problem every single day. As Johnstone (2013) argues, this sort of behavior “is very common, even if much of the time it is conducted informally, or perhaps most often, unconsciously. Most empirical data analyses involve, at the exploration stage, some sort of search for large regression coefficients, correlations or variances, with only those that appear ‘large’, or ‘interesting’ being retained for reporting purposes, or in order to guide further analysis.”

7. Bayesian Inference

Let’s now turn our attention to how a fully-rational Bayesian arbitrageur with non-informative priors should select which features to use? Bayes’ rule tells us that the posterior probability of a particular combination of shocks, \mathrm{Pr}( \mathcal{K}_h | \Delta \tilde{\mathbf{p}}_{1} ), is proportional to the likelihood of observing the realized data given the combination, \mathrm{Pr}( \Delta \tilde{\mathbf{p}}_{1} | \mathcal{K}_h ), times the prior probability of Nature choosing the combination of shocks, \mathrm{Pr}( \mathcal{K}_h ):

(15)   \begin{align*}   \mathrm{Pr}( \mathcal{K}_h | \Delta \tilde{\mathbf{p}}_{1} )   \propto    \mathrm{Pr}( \Delta \tilde{\mathbf{p}}_{1} | \mathcal{K}_h ) \times \mathrm{Pr}( \mathcal{K}_h ) \end{align*}

So, this arbitrageur will select the collection of at most 2 features that maximizes the log-likelihood of the observed data:

(16)   \begin{align*}   \widehat{\mathcal{K}}    &=    \arg \max_{\mathcal{K} \subset \mathcal{Q}} \,    \left\{      \,      \log \mathrm{Pr}( \Delta \tilde{\mathbf{p}}_{1} | \mathcal{K})      \ \, \text{s.t.{}} \ \,     |\mathcal{K}| \leq 2     \,    \right\} \end{align*}

since each of the combinations of shocks is equally likely.

Why is there an inequality sign in Equation (16)? That is, why isn’t the constraint |\mathcal{K}| = 2? Because some of the elements in {\boldsymbol \beta}[\mathcal{K}_{h_\star}] will be small. After all, it’s drawn from a Gaussian distribution. A fully-rational Bayesian arbitrageur will want to ignore some of the smaller elements in {\boldsymbol \beta}[\mathcal{K}_{h_\star}] since he faces overfitting risk. For instance, if all Houston-based firms realize a local tax shock that increases their realized returns to the tune of 0.25{\scriptstyle \%} per year, then it will be impossible for a market-wide arbitrageur to spot this shock. Firm-level volatility can exceed 40{\scriptstyle \%} per year. An arbitrageur trying to recover such a weak signal out from amongst so much noise is more likely to overfit the observed data and draw the wrong inference.

Schwartz (1978) showed that fully Bayesian arbitrageurs in this setting should ignore all coefficients smaller than \beta_{\min} = \sigma_{\epsilon} \cdot \sqrt{2 \cdot \log(Q)}. This is the correct threshold for a Gaussian model in the following sense. Suppose that there were no shocks. That is, we had \mathcal{K} = \emptyset and v_n = \beta_{n,0} for each of the 4 assets. Then, we would like our estimator to tell us that there are no shocks with overwhelming probability:

(17)   \begin{align*} \mathrm{Pr}\left[ \max_{q \in \mathcal{Q}} | \langle \Delta \tilde{p}_{n,1} \rangle_q| > \beta_{\min} \right] &\leq \alpha  \end{align*}

where \alpha is an arbitrarily small number that is chosen in advance, and \langle \cdot \rangle_q denotes the average over the set of assets with exposure to feature q. This particular choice of \beta_{\min} comes from the fact that:

(18)   \begin{align*} \lim_{Q \to \infty} \frac{\max_{q \in \mathcal{Q}} \sfrac{| \langle \Delta \tilde{p}_{n,1} \rangle_q|}{\sigma_{\epsilon}}}{\sqrt{2 \cdot \log(Q)}} = 1 \end{align*}

almost surely for a Gaussian model.

8. Equilibrium with Arbitrageurs

Let’s now wrap up by looking at the effect of these market-wide arbitrageurs on equilibrium asset prices. Prices in period t=1 will be the same as before since arbitrageurs have no information in the first period. As a result, they do not trade in period t=1. To solve for time t=2 prices as a function of arbitrageur demand, simply observe that market clearing implies:

(19)   \begin{align*} - \, \theta_n^{(a)} &= \sum_{k=0}^1 \frac{1}{\gamma} \cdot \frac{\mathrm{E}_{n,2}^{(k)}[v_n] - p_{n,2}}{\mathrm{Var}_{n,2}^{(k)}[v_n]} \end{align*}

Some simplification then yields:

(20)   \begin{align*} p_{n,2} &= \frac{1}{2} \cdot \overbrace{\left( \frac{\sigma_{\beta}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)}^{=B} \cdot \sum_{k=0}^1 \left\{ \epsilon_{n,1}^{(k)} + \epsilon_{n,2}^{(k)} \right\} + \overbrace{\gamma \cdot \sigma_{\beta}^2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)}^{=C} \cdot \theta_n^{(a)} \end{align*}

Thus, we can see that the price of each asset will be weighted average of the signals that the naifs receive and the arbitrageurs’ demand. An asset’s price will be higher if the naifs get more positive asset-specific signals or if arbitrageurs demand more as a result of a more positive feature-specific signal.

Suppose that, after observing period t=1 returns, arbitrageurs believe that features \widehat{\mathcal{K}} have realized a shock. If they are using the Bayesian information criterion, this means that for each k \in \widehat{\mathcal{K}} the estimated \widehat{\beta}_k was larger than \beta_{\min} = \sigma_{\epsilon} \cdot \sqrt{2 \cdot \log(Q)}. It’s possible to write the arbitrageurs’ beliefs about the value of each asset as a linear combination of an asset-specific component, A_n, and the estimated feature-specific shock size, \widehat{\beta}_{k_n}:

(21)   \begin{align*} \mathrm{E}[ v_n | \widehat{\mathcal{K}}, \Delta \tilde{\mathbf{p}}_1] &= A_n + \widehat{\beta}_{k_n} \end{align*}

The asset-specific component, A_n, comes from the fact that, if arbitrageurs believe that an asset’s value is due in part to a feature-specific shock of size \widehat{\beta}_{k_n}, then they can use these beliefs to update their priors about the size of the asset’s idiosyncratic shock. Plugging this linear formula into arbitrageurs’ optimal portfolio holdings yields:

(22)   \begin{align*} \theta_n^{(a)} &= \frac{A_n}{2 \cdot C} + \left(\frac{1- B}{2 \cdot C}\right) \cdot \widehat{\beta}_{k_n} \end{align*}

where the coefficient on \widehat{\beta}_{k_n} can be simplified as follows:

(23)   \begin{align*} \frac{1- B}{2 \cdot C} = \frac{1 - \left( \frac{\sigma_{\beta}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)}{2 \cdot \gamma \cdot \sigma_{\beta}^2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)} = \frac{\frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2}}{2 \cdot \gamma \cdot \sigma_{\beta}^2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)} = \frac{1}{2} \cdot \frac{1}{\gamma \cdot \sigma_{\beta}^2} \end{align*}

This result implies that arbitrageurs decrease their demand for an asset with exposure to, say, a negative political-risk shock by 0.50 \times (\gamma \cdot \sigma_{\beta}^2)^{-1} shares for every \mathdollar 1 increase in the size of the shock.

The key implication of this model is that including a shocked feature in the arbitrageurs’ model of the world will yield a price shock of size:

(24)   \begin{align*} \mathrm{E}[p_{n,2}|\widehat{\mathcal{K}} = \mathcal{K}_{h_\star}] - \mathrm{E}[p_{n,2}|\widehat{\mathcal{K}} = \emptyset] &= \frac{1}{2} \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right) \cdot \widehat{\beta}_{k_n} \end{align*}

For instance, if arbitrageurs were using Bayesian updating, then there would be a discontinuous jump in the effect of a political-risk shock on social media companies like Facebook as the size of the shock crossed if the size of the shock crossed the \beta_{\min} threshold.

Hong, Stein, and Yu (2007)

1. Motivation

It’s absolutely essential that people ignore most contingencies when making predictions in everyday life. Dennett (1984) makes this point quite colorfully by asking: “How is it that I can get myself a midnight snack? I suspect there is some leftover sliced turkey and mayonnaise in the fridge, and bread in the breadbox… and a bottle of beer in the fridge as well… I forthwith put the plan into action and it works! Big deal.” The punchline of the story is that in order to put the plan into action, Dennett actually needs to ignore a great number of hypotheses: “that mayonnaise doesn’t dissolve knives on contact, that a slice of bread is smaller than Mount Everest, and that opening the refrigerator doesn’t cause a nuclear holocaust in the kitchen.” If he didn’t ignore all of these possibilities, he’d never be able to get anything done.

In this note, I work through the asset-pricing model in Hong, Stein, and Yu (2007) which posits that traders use an overly simple model of the world to make predictions about future payouts. The model predicts that there will be sudden shifts in asset prices when traders switch mental models in the same way that there would be a sudden shift in your midnight snacking behavior if you switched mental models and started believing that an open refrigerator door lead to armageddon. Thus, the authors refer to this setup as a model of simple forecasts and paradigm shifts.

2. Asset Structure

There is a single asset which pays out a dividend, D_t, at each point in time t = 0,1,2\ldots. This dividend payout is the sum of 3 components: component A, component B, and noise. Thus, I can write the dividend payout as:

(1)   \begin{align*} D_t &= A_t + B_t + \sigma_D \cdot \varepsilon_{D,t} \end{align*}

where \varepsilon_{D,t} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1). For simplicity, suppose that components A and B both follow \mathrm{AR}(1) processes:

(2)   \begin{align*} A_{t+1} = \rho \cdot A_t + \sigma_A \cdot \varepsilon_{A,t} \qquad \text{and} \qquad B_{t+1} = \rho \cdot B_t + \sigma_B \cdot \varepsilon_{B,t} \end{align*}

with \rho \in (0,1) and \varepsilon_{A,t}, \varepsilon_{B,t} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1). Thus, each of these variables has mean 0 and variance given by:

(3)   \begin{align*} \mathrm{Var}[A_t] = \frac{\sigma_A^2}{1 - \rho^2} \qquad \text{and} \qquad \mathrm{Var}[B_t] = \frac{\sigma_B^2}{1 - \rho^2} \end{align*}

Crucially, each period t traders can see both A_t and B_t as well as \varepsilon_{A,t+1} and \varepsilon_{B,t+1}. Thus, they know the next period’s realizations of A_{t+1} and B_{t+1} even if they choose not to use this information in their simple model. Define the parameter:

(4)   \begin{align*} \theta &= ( 1 + \delta - \rho )^{-1} \end{align*}

Then, a fully rational trader—i.e., someone who takes into consideration both A_t and B_t—with risk neutral preferences would price this asset:

(5)   \begin{align*} P_t^R = V_t^R &= \theta \times (A_{t+1} +  B_{t+1}) \end{align*}

This price in this setting is just the discounted present value of the expected future dividend stream.

3. Benchmark Model

Let’s now consider a benchmark model where traders use an overly simplified model, but never update this model. Specifically, assume traders believe that dividends are determined by only component A and noise:

(6)   \begin{align*} D_t &= A_t + \sigma_D \cdot \varepsilon_{D,t} \end{align*}

i.e., they ignore the fact that B_t actually affects dividends in any way. Let M_t \in \{A,B\} denote the model that traders use to predict dividends. In this benchmark setting, traders’ beliefs on the likelihood that the true model will remain in state A:

(7)   \begin{align*} \mathrm{Pr}\left[ \, M_{t+1} = A \, \middle| \, M_t = A \, \right] &= 1 \end{align*}

Prices in this world are then given by:

(8)   \begin{align*} P_t^A &= V_t^A = \theta \times A_{t+1} \end{align*}

They are the discounted present value of the dividends implied by only component A.

This setup makes it easy to compute the dollar returns for the asset:

(9)   \begin{align*} R_t^A &= D_t + P_t^A - (1 + \delta) \cdot P_{t-1}^A \\ &= \theta \cdot \sigma_A \times \varepsilon_{A,t+1} + \left\{ \, B_t + \sigma_D \cdot \varepsilon_{D,t} \, \right\} \end{align*}

If I define the variable Z_t^A = B_t + \sigma_D \cdot \varepsilon_{D,t} representing the traders’ prediction error, then this formula becomes short and sweet:

(10)   \begin{align*} R_t^A &= \theta \cdot \sigma_A \times \varepsilon_{A,t+1} + Z_t^A \end{align*}

i.e., the returns to holding this asset are the discounted present value of the future innovations to component A plus the prediction error incurred by using only model A instead of the full model.

Asset returns will appear predictable to a more sophisticated trader who knows that both components A and B affect the asset’s dividends. The auto-covariance of of the dollars returns is given by:

(11)   \begin{align*} \mathrm{Cov}\left[ R_t^A,R_{t-1}^A\right] &= \mathrm{Cov}\left[ B_t , B_{t-1}\right] = \rho \cdot \left( \frac{\sigma_B^2}{1 - \rho^2} \right) \end{align*}

Thus, there will be more persistence in asset returns traders’ prediction error from not including model B is more persistent—i.e., when \rho is closer to 1.

4. Belief Updating

Now, let’s move away from this benchmark model and consider the case where traders might switch between simple models. e.g., they might start out exclusively using component A to predict dividends, but then switch over to exclusively using component B after model A does a really bad job. Note that traders are wrong in both cases; however, switching models can still generate better predictions. e.g., think about switching over to model B when component B_t is really large and component A_t is close to 0. Because both A_t and B_t are positively auto-correlated, exclusively using model B will give higher fidelity predictions about the dividend level in the next few periods.

Let \pi_A denote traders’ belief that the true model will remain in state A next period given that it’s in state A now:

(12)   \begin{align*} \mathrm{Pr}\left[ \, M_{t+1} = A \, \middle| \, M_t = A \, \right] &= \pi_A \end{align*}

Similarly, let \pi_B denote traders’ belief that the true model will remain in state B next period given that it’s in state B now:

(13)   \begin{align*} \mathrm{Pr}\left[ \, M_{t+1} = B \, \middle| \, M_t = B \, \right] &= \pi_B \end{align*}

This setup means that, for instance, traders believe that the fraction of time the market spends in model A is given by:

(14)   \begin{align*} \frac{1 - \pi_B}{2 - \pi_A - \pi_B} \end{align*}

For simplicity, I assume a symmetric setting such that \pi_A = \pi_B = \pi \in (\sfrac{1}{2},1). This rule has to be consistent with the true transition probability of their beliefs in equilibrium; however, it’s important to emphasize that having any beliefs about \pi is in some sense wrong since components A and B always contribute to dividend payouts.

While traders always exclusively use either component A_t or component B_t to predict dividend payouts, somewhere in the dark recesses of their mind they have beliefs about when they should switch mental models. e.g., if you started making a midnight snack, you might not immediately know what to do when your first knife dissolved in the mayonnaise jar, but you wouldn’t ruin several knives in a row this way. Let f^A(D_t) denote traders’ beliefs about the distribution of dividends in period t given that they entered the period using only component A_t to predict dividend payouts:

(15)   \begin{align*} f^A(D_t) &= \frac{1}{\sigma_D} \cdot \phi\left( \frac{D_t - A_t}{\sigma_D} \right) =  \frac{1}{\sigma_D} \cdot \phi\left( \frac{1}{\sigma_D} \cdot Z_t^A \right) \end{align*}

Traders’ Bayesian posterior going into period (t + 1) about whether or not model A is still the correct model is then given by:

(16)   \begin{align*} Q_{t+1} &= \sfrac{1}{2} + (2 \cdot \pi - 1) \cdot (X_{t+1} - \sfrac{1}{2}) \end{align*}

The parameter \pi is just traders’ priors on the model switching probability. The variable X_t is given by:

(17)   \begin{align*} X_{t+1} &= \frac{Q_t \cdot L_t}{1 - Q_t \cdot (1 - L_t)} \end{align*}

where L_t denotes the likelihood ratio as:

(18)   \begin{align*} L_t &= \frac{f^A(D_t)}{f^B(D_t)} = \exp\left\{ \, - \, \frac{(Z_t^A)^2 - (Z_t^B)^2}{2 \cdot \sigma_D^2} \, \right\} \end{align*}

Note that this ratio is always non-negative, and is increasing in the difference |Z_t^A| - |Z_t^B|. i.e., traders tilt their beliefs toward model A after seeing that |Z_t^A| is smaller than |Z_t^B| and vice versa.

5. Model with Learning

From here on out, solving a model where traders learn from their past errors and switch between simplified mental models is quite straight-forward. Without loss of generality, let’s consider the case where traders enter period t using only component A to predict dividends. Then, traders switch models if:

(19)   \begin{align*} M_{t+1} &= \begin{cases} A &\text{if } Q_{t+1} \geq q \\ B &\text{else} \end{cases} \end{align*}

for q < \sfrac{1}{2}. e.g., if q = 0.05, then traders will continue to make forecasts exclusively with component A until it is rejected at the 5{\scriptstyle \%} confidence level. Once this happens, they will switch over to exclusively using component B. The smaller is q, the stronger is the degree of resistance to model change.

In this setup, there are then 2 different regimes to consider when computing returns: i) no shift (\mathit{NS}) and ii) shift (S). The returns in the no shift regime are the exact same as before:

(20)   \begin{align*} R_t^{\mathit{NS}} &= Z_t^A + \theta \times \varepsilon_{A,t+1} \end{align*}

since the traders ignore the possibility of there ever being another component B when using model A. The returns in the shift regime are more complicated:

(21)   \begin{align*} R_t^S &= Z_t^A + \theta \times \varepsilon_{B,t+1} + \rho \cdot \theta \times (B_t - A_t) \end{align*}

The returns when traders shift from model A to model B differ from the no shift regime because traders purge all current and lagged model A-information from prices and replace it with model B-information.

Two Period Kyle (1985) Model

1. Motivation

This post shows how to solve for the equilibrium price impact and demand coefficients in a 2 period Kyle (1985)-type model where informed traders see a noisy signal about the fundamental value of a single asset. There are various other places where you can see how to solve this sort of model. e.g., take a look at Markus Brunnermeier’s class notes or Laura Veldkamp’s excellent textbook. Both these sources solve the static 1 period model in closed form, and then give the general T \geq 1 period form of the dynamic multi-period model. Any intuition that I can get with a dynamic model usually comes in the first 2 periods, so I find myself frequently working out the 2 period model explicitly. Here is that model.

2. Market description

I begin by outlining the market setting. Consider a world with 2 trading periods t = 1, 2 and a single asset whose fundamental value is given by:

(1)   \begin{align*} v \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \sigma_{v}^2) \end{align*}

in units of dollars per share. There are 2 kinds of agents: informed traders and noise traders. Both kinds of traders submit market orders to a group of market makers who see only the aggregate order flow, \Delta x_t, each period:

(2)   \begin{align*} \Delta x_t &= \Delta y_t + \Delta z_t \end{align*}

where \Delta y_t denotes the order flow from the informed traders and \Delta z_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \sigma_{\Delta z}^2) denotes the order flow from the noise traders. The market makers face perfect competition, so they have to set the price each period equal to their expectation of the fundamental value of the asset given aggregate demand:

(3)   \begin{align*} p_1 &= \mathrm{E}[v|\Delta x_1] \qquad \text{and} \qquad p_2 = \mathrm{E}[v|\Delta x_1, \Delta x_2] \end{align*}

Prior to the start of the first trading period, informed traders see an unbiased signal s about the asset’s fundamental value:

(4)   \begin{align*} s = v + \epsilon \qquad \text{where} \qquad \epsilon \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}

so that s \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(v,\sigma_{\epsilon}^2). In period 1, these traders choose the number of shares to demand from the market maker, \Delta y_1, to solve:

(5)   \begin{align*} \mathrm{H}_0 = \max_{\Delta y_1} \, \mathrm{E}\left[ \, (v - p_1) \cdot \Delta y_1 + \mathrm{H}_1 \, \middle| \, s \, \right] \end{align*}

where \mathrm{H}_{t-1} denotes their value function entering period t. Similarly, in period 2 these traders optimize:

(6)   \begin{align*} \mathrm{H}_1 = \max_{\Delta y_2} \, \mathrm{E} \left[ \, (v - p_2) \cdot \Delta y_2 \, \middle| \, s, \, p_1  \ \right] \end{align*}

The extra H_1 term shows up in informed traders’ time t=1 optimization problem but not their time t=2 optimization problem because the model ends after the second trading period.

An equilibrium is a linear demand rule for the informed traders in each period:

(7)   \begin{align*}  \Delta y_t = \alpha_{t-1} + \beta_{t-1} \cdot s \end{align*}

and a linear market maker pricing rule in each period:

(8)   \begin{align*}  p_t = \kappa_{t-1} + \lambda_{t-1} \cdot \Delta x_t \end{align*}

such that given the demand rule in each period the pricing rule solves the market maker’s problem, and given the market maker pricing rule in each period the demand rule solves the trader’s problem.

3. Information and Updating

The informed traders need to update their beliefs about the fundamental value of the asset after observing their signal s. Using DeGroot (1969)-style updating, it’s possible to compute their posterior beliefs:

(9)   \begin{align*} \sigma_{v|s}^2 &= \left( \frac{\sigma_{\epsilon}^2}{\sigma_v^2 + \sigma_{\epsilon}^2} \right) \times \sigma_v^2 \qquad \text{and} \qquad \mu_{v|s} = \underbrace{\left( \frac{\sigma_v^2}{\sigma_v^2 + \sigma_{\epsilon}^2} \right)}_{\theta} \times s \end{align*}

After observing aggregate order flow in period t=1, market makers need to update their beliefs about the true value of the asset. Using the linearity of informed traders’ demand rule, we can rewrite the aggregate demand as a signal about the fundamental value as follows:

(10)   \begin{align*} \frac{\Delta x_1}{\beta_0} &= v + \left( \epsilon + \frac{\Delta z_1}{\beta_0} \right) \end{align*}

Note that both the signal error and noise trader demand cloud the market makers’ inference. Using the same DeGroot updating strategy, it’s possible to compute the market makers’ posterior beliefs about v as follows:

(11)   \begin{align*} \sigma_{v|\Delta x_1}^2 = \left( \frac{\beta_0^2 \cdot \sigma_{\epsilon}^2 + \sigma_{\Delta z}^2}{\beta_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \times \sigma_v^2 \quad \text{and} \quad \mu_{v|\Delta x_1} = \left( \frac{\beta_0^2 \cdot \sigma_v^2}{\beta_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \times \Delta x_1 \end{align*}

It’s also possible to view the aggregate order flow in time t=1 as a signal about the informed traders’ signal rather than the fundamental value of the asset:

(12)   \begin{align*} \frac{\Delta x_1}{\beta_0} &= s + \frac{\Delta z_1}{\beta_0} \end{align*}

yielding posterior beliefs:

(13)   \begin{align*} \sigma_{s|\Delta x_1}^2 = \left( \frac{\sigma_{\Delta z}^2}{\sigma_{\Delta z}^2 + \beta_0^2 \cdot \sigma_s^2} \right) \times \sigma_s^2 \quad \text{and} \quad \mu_{s|\Delta x_1} = \left( \frac{\beta_0^2 \cdot \sigma_s^2}{\sigma_{\Delta z}^2 + \beta_0^2 \cdot \sigma_s^2} \right) \times \Delta x_1 \end{align*}

4. Second Period Solution

With the market description and information sets in place, I can now solve the model by working backwards. Let’s start with the market makers’ time t=2 problem. Since the market maker faces perfect competition, the time t=1 price has to satisfy the condition:

(14)   \begin{align*} \mathrm{E}[v|\Delta x_1] &= p_1 \end{align*}

As a result, \kappa_0 = 0 and

(15)   \begin{align*} \kappa_1  &= \mathrm{E}[v|\Delta x_1] - \lambda_1 \cdot \mathrm{E}[\Delta x_2|\Delta x_1] = p_1 - \underbrace{(\theta \cdot \mu_{s | \Delta x_1} - p_1)}_{=0} = p_1 \end{align*}

However, this is about all we can say without knowing more about how the informed traders behave.

Moving to the informed traders’ time t=2 problem, we see that they optimize over the size of their time t=2 market order with knowledge of their private signal, s, and the time t=1 price, p_1, as follows:

(16)   \begin{align*} \mathrm{H}_1 &= \max_{\Delta y_2} \ \mathrm{E} \left[ \, \left(v - \kappa_1 - \lambda_1 \cdot \Delta x_2 \right) \cdot \Delta y_2 \, \middle| \, s, p_1  \, \right] \end{align*}

Taking the first order condition yields an expression for their optimal time t=2 demand:

(17)   \begin{align*} \Delta y_2 &= \underbrace{- \, \frac{p_1}{2 \cdot \lambda_1}}_{\alpha_1} + \underbrace{\frac{\theta}{2 \cdot \lambda_1}}_{\beta_1} \cdot s \end{align*}

Informed traders place market orders in period t=2 that are linearly increasing in the size of their private signal; what’s more, if we hold the equilibrium value of \lambda_1 constant, they will trade more aggressively when they have a more accurate private signal (i.e., \sigma_{\epsilon}^2 \searrow 0).

If we now return to the market makers’ problem, we can partially solve for the price impact coefficient in period t=2:

(18)   \begin{align*} \lambda_1  &= \frac{\mathrm{Cov}[ \Delta x_2, v | \Delta x_1]}{\mathrm{Var}[ \Delta x_2| \Delta x_1]} = \frac{\mathrm{Cov}\left[ \, \alpha_1 + \beta_1 \cdot s + \Delta z_2, v \, \middle| \, \Delta x_1 \, \right]}{\mathrm{Var}\left[ \, \alpha_1 + \beta_1 \cdot s + \Delta z_2 \, \middle| \, \Delta x_1 \, \right]} = \frac{\beta_1 \cdot \sigma_{v|\Delta x_1}^2}{\beta_1^2 \cdot \sigma_{s|\Delta x_1}^2 + \sigma_{\Delta z}^2} \end{align*}

However, to go any further and solve for \sigma_{v|\Delta x_1}^2 or \sigma_{s|\Delta x_1}^2, we need to know how aggressively traders will act on their private information in period t=1… we need to know \beta_0.

5. First Period Solution

To solve the informed traders’ time t=1 problem, I first make an educated guess about the functional form of their value function:

(19)   \begin{align*} \mathrm{E}[\mathrm{H}_1|s] &= \psi_1 + \omega_1 \cdot \left( \mu_{v|s} - p_1 \right)^2 \end{align*}

We can now solve for the time t=1 equilibrium parameter values by plugging in the linear price impact and demand coefficients to the informed traders’ optimization problem:

(20)   \begin{align*} \mathrm{H}_0 &= \max_{\Delta y_1} \, \mathrm{E}\left[ \, (v - p_1) \cdot \Delta y_1 + \psi_1 + \omega_1 \cdot \left( \theta \cdot s - p_1 \right)^2 \, \middle| \, s \, \right] \end{align*}

Taking the first order condition with respect to the informed traders’ time t=1 demand gives:

(21)   \begin{align*} 0 &= \mathrm{E}\left[ \, \left(v - 2 \cdot \lambda_0 \cdot \Delta y_1 - \lambda_0 \cdot \Delta z_1 \right)   - 2 \cdot \omega_1 \cdot \lambda_0 \cdot \left( \theta \cdot s - \lambda_0 \cdot \{ \Delta y_1 + \Delta z_1  \} \right) \, \middle| \, s \, \right] \end{align*}

Evaluating their expectation operator yields:

(22)   \begin{align*} 0 &= \theta \cdot s - 2 \cdot \lambda_0 \cdot \Delta y_1 - 2 \cdot \omega_1 \cdot \lambda_0 \cdot \left\{   \theta \cdot s - \lambda_0 \cdot \Delta y_1 \right\}  \end{align*}

Rearranging terms then gives the informed traders’ demand rule which is a linear function of the signal they got about the asset’s fundamental value:

(23)   \begin{align*} \Delta y_1 &= \frac{\theta}{2 \cdot \lambda_0} \cdot \left( \frac{1 - 2 \cdot \omega_1 \cdot \lambda_0}{1 - \omega_1 \cdot \lambda_0} \right) \cdot s \end{align*}

Finally, using the same projection formula as above, we can solve for the market makers’ price impact rule:

(24)   \begin{align*} \lambda_0 &= \frac{\mathrm{Cov}[ \Delta x_1, v]}{\mathrm{Var}[ \Delta x_1]} = \frac{\mathrm{Cov}[\alpha_0 + \beta_0 \cdot (v + \epsilon) + \Delta z_1, v]}{\mathrm{Var}[ \alpha_0 + \beta_0 \cdot s + \Delta z_1]} = \frac{\beta_0 \cdot \sigma_v^2}{\beta_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \end{align*}

6. Guess Verification

To wrap things up, let’s now check that my guess about the value function is consistent. Looking at the informed traders’ time t=2 problem, and substituting in the equilibrium coefficients we get:

(25)   \begin{align*} \mathrm{H}_1 &= \mathrm{E} \left[ \, \left(v - p_2 \right) \cdot \Delta y_2 \, \middle| \, s  \, \right] \\ &= \mathrm{E} \left[ \, \left(v - \left\{p_1 + \lambda_1 \cdot \left( \alpha_1 + \beta_1 \cdot s + \Delta z_2 \right) \right\}  \right) \times \left( \alpha_1 + \beta_1 \cdot s \right) \, \middle| \, s  \, \right] \end{align*}

Using the fact that \alpha_1 = -\sfrac{p_1}{(2 \cdot \lambda_1)} and \beta_1 = \sfrac{\theta}{(2 \cdot \lambda_1)} then leads to:

(26)   \begin{align*} \mathrm{H}_1 &= \mathrm{E} \left[ \, \frac{1}{2 \cdot \lambda_1} \times \left( \left\{ v - p_1 \right\} - \frac{1}{2} \cdot \left\{ \theta \cdot s - p_1 \right\} - \lambda_1 \cdot \Delta z_2 \right) \times \left( \theta \cdot s - p_1 \right) \, \middle| \, s  \, \right] \end{align*}

Adding and subtracting \mu_{s | \Delta x_1} = \theta \cdot s in the first term simplifies things even further:

(27)   \begin{align*} \mathrm{H}_1 &= \mathrm{E} \left[ \, \frac{1}{2 \cdot \lambda_1} \times \left( \left\{ v - \theta \cdot s \right\} + \frac{1}{2} \cdot \left\{ \theta \cdot s - p_1 \right\} - \lambda_1 \cdot \Delta z_2 \right) \times \left( \theta \cdot s - p_1 \right) \, \middle| \, s  \, \right] \end{align*}

Thus, informed traders’ continuation value is quadratic in the distance between their expectation of the fundamental value and the period t=1 price:

(28)   \begin{align*} \mathrm{H}_1 &= \text{Const.} + \underbrace{\frac{1}{4 \cdot \lambda_1}}_{\omega_1} \cdot \left( \mu_{v|s} - p_1 \right)^2 \end{align*}

which is consistent with the original linear quadratic guess. Boom.

7. Numerical Analysis

Given the analysis above, we could derive the correct values of all the other equilibrium coefficients if we knew the optimal \beta_0. To compute the equilibrium coefficient values, make an initial guess, \widehat{\beta}_0, and use this guess to compute the values of the other equilibrium coefficients:

(29)   \begin{align*} \widehat{\lambda}_0 &\leftarrow \frac{\widehat{\beta}_0 \cdot \sigma_v^2}{\widehat{\beta}_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \\ \widehat{\sigma}_{v|\Delta x_1}^2 &\leftarrow \left( \frac{\widehat{\beta}_0^2 \cdot \sigma_{\epsilon}^2 + \sigma_{\Delta z}^2}{\widehat{\beta}_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \cdot \sigma_v^2 \\ \widehat{\sigma}_{s|\Delta x_1}^2 &\leftarrow \left( \frac{\sigma_{\Delta z}^2}{\widehat{\beta}_0^2 \cdot \sigma_s^2 + \sigma_{\Delta z}^2} \right) \cdot \sigma_s^2 \\ \widehat{\lambda}_1 &\leftarrow \frac{1}{\sigma_{\Delta z}} \cdot \sqrt{ \frac{\theta}{2} \cdot \left( \widehat{\sigma}_{v|\Delta x_1}^2 - \frac{\theta}{2} \cdot \widehat{\sigma}_{s|\Delta x_1}^2 \right) } \end{align*}

Then, just iterate on the initial guess numerically until you find that:

(30)   \begin{align*} \widehat{\beta}_0 &= \frac{\theta}{2 \cdot \widehat{\lambda}_0} \cdot \left( \frac{1 - 2 \cdot \widehat{\omega}_1 \cdot \widehat{\lambda}_0}{1 - \widehat{\omega}_1 \cdot \widehat{\lambda}_0} \right) \end{align*}

since we know that \beta_0 must satisfy this condition in equilibrium.

The figure below plots the coefficient values at various levels of noise trader demand and signal error for inspection. Here is the code. The informed traders are more aggressive with there is more noise trader demand (i.e., moving across panels from left to right) and in the second trading period (i.e., blue vs red). The trade less aggressively as their signal quality degrades (i.e., moving within panel from left to right).

plot--2-period-kyle-model-solution--11aug2014

Fano’s Inequality and Resource Allocation

1. Motivation

This post describes Fano’s inequality. It’s not a particularly complicated result. After all, it first shows up on page 33 of Cover and Thomas (1991). However, I recently ran across the result again for the first time in a while, and I realized it had an interesting asset pricing implication.

Roughly speaking, what does inequality say? Suppose I need to make some decision, and you give me some news that helps me decide. Fano’s inequality gives a lower bound on the probability that I end up making the wrong choice as a function of my initial uncertainty and how informative your news was. What’s cool about the result is that it doesn’t place any restrictions on how I make my decision. i.e., it gives a lower bound on my best case error probability. If the bound is negative, then in principle I might be able to eliminate my decision error. If the bound is positive (i.e., binds), then there is no way for me to use the news you gave me to always make the right decision.

Now, back to asset pricing. We want accurate prices so that, in the words of Fama (1970), they can serve as “signals for resource allocation.” If we treat resource allocation as a discrete choice problem and prices as news, then Fano’s inequality applies and gives bounds on how effectively decision makers can use this information.

2. Notation

I start by laying out the notation. Imagine that a decision maker wants to predict the value of a random variable \widetilde{X} that can take on N possible values:

(1)   \begin{align*} \widetilde{X} \in \{ x_1,x_2,\ldots,x_N \} \end{align*}

e.g., you might think about the decision maker as a farmer and \widetilde{X} as the most profitable crop he can plant next fall. The probability that \widetilde{X} takes on each of the N values is given by:

(2)   \begin{align*} \mathrm{Pr}[\widetilde{X} = x_n] = p_n \end{align*}

Finally, I use the \mathrm{H}[\cdot] operator to denote the entropy of a random variable:

(3)   \begin{align*} \mathrm{H}[\widetilde{X}] &= - \sum_{n=1}^N p_n \cdot \log_2(p_n) \end{align*}

3. Main Result

Now, imagine that the farmer knows which crop currently has the highest futures price, \widetilde{Y}, and that this price signal is correlated with the correct choice of which crop to plant:

(4)   \begin{align*} \mathrm{Cor}[\widetilde{X},\widetilde{Y}] \neq 0 \end{align*}

The farmer could use this information to make an educated guess about the right crop to plant:

(5)   \begin{align*} f(\widetilde{Y}) \in \{ x_1, x_2, \ldots, x_N\} \end{align*}

e.g., his rule might be something simple like, “Plant the crop with the highest futures price today.” Or, it might be something more complicated like, “Plant the crop with the highest futures price today unless it’s corn in which case plant soy beans.” I am agnostic about what function f(\cdot) the farmer uses to turn price signals into crop decisions. Let \widetilde{Z} denote whether or not he got the decision right though:

(6)   \begin{align*} \widetilde{Z} &= \begin{cases} 0 &\text{if } f(\widetilde{Y}) = \widetilde{X} \\ 1 &\text{else } \end{cases} \end{align*}

Fano’s inequality links the probability that the farmer makes the wrong crop choice, \mathrm{E}[\widetilde{Z}], to his remaining entropy after seeing the price signals, \mathrm{H}[\widetilde{X}|\widetilde{Y}]:

(7)   \begin{align*} 1 + \mathrm{E}[\widetilde{Z}] \cdot \log_2(N) \geq \mathrm{H}[\widetilde{X}|\widetilde{Y}] \end{align*}

4. Quick Proof

The result follows from applying the entropy chain rule in 2 different ways. Let’s think about the entropy of the joint distribution of errors and crop choices, (\widetilde{Z},\widetilde{X}), after the farmer see the price signal, \widetilde{Y}. The entropy chain rule says that we can rewrite this quantity as:

(8)   \begin{align*} \mathrm{H}[\widetilde{Z},\widetilde{X}|\widetilde{Y}] &= \mathrm{H}[\widetilde{X}|\widetilde{Y}] + \underbrace{\mathrm{H}[\widetilde{Z}|\widetilde{X},\widetilde{Y}]}_{=0} \end{align*}

where the second term on the right-hand side is 0 since if you know the correct crop choice you will never make an error. Yet, we can also rewrite \mathrm{H}[\widetilde{Z},\widetilde{X}|\widetilde{Y}] as follows using the exact same chain rule:

(9)   \begin{align*} \mathrm{H}[\widetilde{Z},\widetilde{X}|\widetilde{Y}] &= \mathrm{H}[\widetilde{Z}|\widetilde{Y}] + \mathrm{H}[\widetilde{X}|\widetilde{Z},\widetilde{Y}] \end{align*}

It’s not like either \widetilde{Z} or \widetilde{X} has a privileged position in the joint distribution (\widetilde{Z},\widetilde{X})!

Applying the chain rule in 2 ways then leaves us with the equation:

(10)   \begin{align*} \mathrm{H}[\widetilde{Z}|\widetilde{Y}] + \mathrm{H}[\widetilde{X}|\widetilde{Z},\widetilde{Y}] & = \mathrm{H}[\widetilde{X}|\widetilde{Y}] \end{align*}

The first term on the left-hand side is bounded above by:

(11)   \begin{align*} \mathrm{H}[\widetilde{Z}|\widetilde{Y}] \leq \mathrm{H}[\widetilde{Z}] \leq 1 \end{align*}

since conditioning on a random variable weakly lowers entropy and a binary choice variable has at most 1 bit of information. Rewriting the second term on the left-hand side as follows:

(12)   \begin{align*} \mathrm{H}[\widetilde{X}|\widetilde{Z},\widetilde{Y}] &= \mathrm{Pr}[\widetilde{Z} = 0] \cdot \underbrace{\mathrm{H}[\widetilde{X}|\widetilde{Z} = 0,\widetilde{Y}]}_{=0} + \mathrm{Pr}[\widetilde{Z} = 1] \cdot \mathrm{H}[\widetilde{X}|\widetilde{Z} = 1,\widetilde{Y}] \end{align*}

then gives the desired result since the uniform distribution maximizes a discrete variable’s entropy:

(13)   \begin{align*} \mathrm{H}[\widetilde{X}|\widetilde{Z} = 1,\widetilde{Y}] \leq \log_2(N - 1) \leq \log_2(N) \end{align*}

5. Application

Now let’s consider an application. Suppose that the farmer can plant N=4 different crops: 1) corn, 2) wheat, 3) soy, and 4) rice. Let \widetilde{X} denote the most profitable of these crops to plant, and let \widetilde{Y} denote the crop with the highest current futures price. Suppose the choice and price variables have the following joint distribution:

(14)   \begin{align*} \bordermatrix{~   & x_1           & x_2           & x_3           & x_4           \cr               y_1 & \sfrac{1}{8}  & \sfrac{1}{16} & \sfrac{1}{32} & \sfrac{1}{32} \cr               y_2 & \sfrac{1}{16} & \sfrac{1}{8}  & \sfrac{1}{32} & \sfrac{1}{32} \cr               y_3 & \sfrac{1}{16} & \sfrac{1}{16} & \sfrac{1}{16} & \sfrac{1}{16} \cr               y_4 & \sfrac{1}{4}  & 0             & 0             & 0             \cr} \end{align*}

e.g., this table reads that 25{\scriptstyle \%} of time rice has the highest futures contract price, and conditional on rice having the highest future price the farmer should always plant corn the following year. In this world, the conditional entropy of the farmer’s decision after seeing the price signal is given by:

(15)   \begin{align*} \mathrm{H}[\widetilde{X}|\widetilde{Y}] &= \sum_{n=1}^4 \mathrm{Pr}[\widetilde{Y} = y_n] \cdot \mathrm{H}[\widetilde{X}|\widetilde{Y} = y_n] = \sfrac{11}{8} \end{align*}

in units of bits.

Here’s the punchline. If we rearrange Fano’s inequality to isolate the error rate on the left-hand side, we see that there is no way for the farmer to plant the right crop more that \sfrac{13}{16} \approx 81{\scriptstyle \%} of the time:

(16)   \begin{align*} \mathrm{E}[Z]  \geq \frac{\mathrm{H}[X|Y] - 1}{\log_2(N)} =    \frac{\sfrac{11}{8} - 1}{\log_2(4)}  =    \frac{3}{16} \end{align*}

What’s more, this result is independent of how the farmer incorporates the price information.

Wavelet Variance

1. Motivation

Imagine you’re a trader who’s about to put on a position for the next month. You want to hedge away the risk in this position associated with daily fluctuations in market returns. One way that you might do this would be to short the S&P 500 since E-mini contracts are some of the most liquid in the world.

plot--sp500-price-volume--24jul2014

Flash_CrashBut… how much of the variation in the index’s returns is due to fluctuations at the daily horizon? e.g., the blue line in the figure to the right shows the minute-by-minute price of the E-mini contract on May 6th, 2010 during the flash crash. Over the course of 4 minutes, the contract price fell 3{\scriptstyle \%}! It then rebounded back to nearly its original position over the next hour. Clearly, if most of the fluctuations in the E-mini S&P 500 contract value is due to shocks on the sub-hour time scale, this contract will do a poor job hedging away daily market risk.

This post demonstrates how to decompose the variance of a time series (e.g., the minute-by-minute returns on the E-mini) into horizon specific components using wavelets. i.e., using the wavelet variance estimator allows you to ask the questions: “How much of the variance is coming from fluctuations on the scale of 16 minutes? 1 hour? 1 day? 1 month?” I then investigate how this wavelet variance approach compares to other methods financial economists might employ such as auto-regressive functions and spectral analysis.

2. Wavelet Analysis

In order to explain how the wavelet variance estimator works, I first need to give a quick outline of how wavelets work. Wavelets allow you to decompose a signal into components that are independent in both the time and frequency domains. This outline will be as bare bones as possible. See Percival and Walden (2000) for an excellent overview of the topic.

Imagine you’ve got a time series of just T = 8 returns:

(1)   \begin{align*} \mathbf{r} = \begin{bmatrix} r_0 & r_1 & r_2 & r_3 & r_4 & r_5 & r_6 & r_7 \end{bmatrix}^{\top} \end{align*}

and assume for simplicity that these returns have mean \mathrm{E}[r_t] = \mu_r = 0. One thing that you might do with this time series is estimate a regression with time fixed effects: r_t = \sum_{t'=0}^7 \vartheta_{t'} \cdot 1_{\{\mathrm{Time}(r_t) = t'\}}. Here is another way to represent the same regression:

(2)   \begin{align*} \begin{bmatrix} r_0 \\ r_1 \\ r_2 \\ r_3 \\ r_4 \\ r_5 \\ r_6 \\ r_7 \end{bmatrix} &= \begin{pmatrix}  1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\  0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\  0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\  0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\  0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\  0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix} \begin{bmatrix} \vartheta_0 \\ \vartheta_1 \\ \vartheta_2 \\ \vartheta_3 \\ \vartheta_4 \\ \vartheta_5 \\ \vartheta_6 \\ \vartheta_7 \end{bmatrix} \end{align*}

It’s really a trivial projection since \vartheta_t = r_t. Call the projection matrix \mathbf{F} for “fixed effects” sot that \mathbf{r} = \mathbf{F}{\boldsymbol \vartheta}.

Obviously, the above time fixed effect model would be a bit of a silly thing to estimate, but notice that the projection matrix \mathbf{F} has an interesting property. Namely, each column is orthonormal:

(3)   \begin{align*}  \langle \mathbf{f}(t) | \mathbf{f}(t') \rangle = \begin{cases} 1 &\text{if } t = t' \\ 0 &\text{else } \end{cases} \end{align*}

It’s orthogonal because \langle \mathbf{f}(t) | \mathbf{f}(t') \rangle = 0 unless t = t'. This requirement implies that each column in the projection matrix is picking up different information about \mathbf{r}. It’s normal because \langle \mathbf{f}(t) | \mathbf{f}(t) \rangle is normalized to equal 1. This requirement implies that the projection matrix is leaving the magnitude of \mathbf{r} unchanged. The time fixed effects projection matrix, \mathbf{F}, compares each successive time period, but you can also think about using other orthonormal bases.

e.g., the Haar wavelet projection matrix compares how the 1st half of the time series differs from the 2nd half, how the 1st quarter differs from the 2nd quarter, how the 3rd quarter differs from the 4th quarter, how the 1st eighth differs from the 2nd eighth, and so on… For the 8 period return time series, let’s denote the columns of the wavelet projection matrix as:

(4)   \begin{align*} \mathbf{w}(3,0) &= \sfrac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix}^{\top} \\ \mathbf{w}(2,0) &= \sfrac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 & 1 & 1 & -1 & -1 & -1 & -1 \end{bmatrix}^{\top} \\ \mathbf{w}(1,0) &= \sfrac{1}{\sqrt{4}} \cdot \begin{bmatrix} 1 & 1 & -1 & -1 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(1,1) &= \sfrac{1}{\sqrt{4}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 1 & -1 & -1 \end{bmatrix}^{\top} \\ \mathbf{w}(0,0) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,1) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 1 & -1 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,2) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,3) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \end{bmatrix}^{\top} \end{align*}

and simple inspection shows that each column is orthonormal:

(5)   \begin{align*}  \langle \mathbf{w}(h,i) | \mathbf{w}(h',i') \rangle = \begin{cases} 1 &\text{if } h = h', \; i = i' \\ 0 &\text{else } \end{cases} \end{align*}

Let’s look at a concrete example. Suppose that we want to project the vector:

(6)   \begin{align*} \mathbf{r} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \end{align*}

onto the wavelet basis:

(7)   \begin{align*} \begin{bmatrix} r_0 \\ r_1 \\ r_2 \\ r_3 \\ r_4 \\ r_5 \\ r_6 \\ r_7 \end{bmatrix} &= \begin{pmatrix}  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & \sfrac{1}{\sqrt{4}}  & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & \sfrac{1}{\sqrt{4}}  & 0 & -\sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & -\sfrac{1}{\sqrt{4}} & 0 & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & -\sfrac{1}{\sqrt{4}} & 0 & 0 & -\sfrac{1}{\sqrt{2}} & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & \sfrac{1}{\sqrt{4}}  & 0 & 0 & \sfrac{1}{\sqrt{2}} & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & \sfrac{1}{\sqrt{4}}  & 0 & 0 & -\sfrac{1}{\sqrt{2}} & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & -\sfrac{1}{\sqrt{4}} & 0 & 0 & 0 & \sfrac{1}{\sqrt{2}} \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & -\sfrac{1}{\sqrt{4}} & 0 & 0 & 0 & -\sfrac{1}{\sqrt{2}} \end{pmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5 \\ \theta_6 \\ \theta_7 \end{bmatrix} \end{align*}

What would the wavelet coefficients {\boldsymbol \theta} look like? Well, a little trial and error shows that:

(8)   \begin{align*} {\boldsymbol \theta} = \begin{bmatrix} \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{4}} & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \end{bmatrix}^{\top} \end{align*}

since this is the only combination of coefficients that satisfies both r_0 = 1:

(9)   \begin{align*} 1 &=  r_0 \\ &= \frac{1}{\sqrt{8}} \cdot w_0(3,0) + \frac{1}{\sqrt{8}} \cdot w_0(2,0) + \frac{1}{\sqrt{4}} \cdot w_0(1,0) + \frac{1}{\sqrt{2}} \cdot w_0(0,0) \\ &= \frac{1}{8} + \frac{1}{8} + \frac{1}{4} + \frac{1}{4} \end{align*}

and r_t = 0 for all t > 0.

What’s cool about the wavelet projection is that the coefficients represent effects that are isolated in both the frequency and time domains. The index h=0,1,2,3 denotes the \log_2 length of the wavelet comparison groups. e.g. the 4 wavelets with h=0 compare 2^0 = 1 period increments: the 1st period to the 2nd period, the 3rd period to the 4th period, and so on… Similarly, the wavelets with h=1 compare 2^1 = 2 period increments: the 1st 2 periods to the 2nd 2 periods and the 3rd 2 periods to the 4th 2 periods. Thus, the h captures the location of the coefficient in the frequency domain. The index i=0,\ldots,I_h signifies which comparison groups at horizon h we are looking at. e.g., when h=0, there are I_0 = 4 = \sfrac{8}{2^{0+1}} different comparisons to be made. Thus, the i captures the location of the coefficient in the time domain.

3. Wavelet Variance

With these basics in place, it’s now easy to define the wavelet variance of a time series. First, I massage the standard representation of a series’ variance a bit. The variance of our 8 term series is defined as:

(10)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot \sum_t r_t^2  \end{align*}

since \mu_r = 0. Using the tools from the section above, let’s rewrite \mathbf{r} = \mathbf{W}{\boldsymbol \theta}. This means that the variance formula becomes:

(11)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot \mathbf{r}^{\top} \mathbf{r} =  \frac{1}{T} \cdot \left( \mathbf{W} {\boldsymbol \theta} \right)^{\top} \left( \mathbf{W} {\boldsymbol \theta} \right) \end{align*}

But I know that \mathbf{W}^{\top} \mathbf{W} = \mathbf{I} since each of the columns is orthonormal. Thus:

(12)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot {\boldsymbol \theta}^{\top} {\boldsymbol \theta} = \frac{1}{T} \cdot \sum_{h,i} \theta(h,i)^2 \end{align*}

This representation gives the variance of a series as an average of squared wavelet coefficients.

The sum of the squared wavelet coefficients at each horizon, h, is then an interesting object:

(13)   \begin{align*} V(h) &= \frac{1}{T} \cdot \sum_{i=0}^{I_h} \theta(h,i)^2 \end{align*}

since V(h) denotes the fraction of the total variance of the time series explained by comparing successive periods of length 2^h. I refer to V(h) as the wavelet variance of a series at horizon h. The sum of the wavelet variances at each horizon gives total variance:

(14)   \begin{align*} \sum_{h=0}^H V(h) &= \sigma_r^2 \end{align*}

4. Numerical Example

Let’s take a look at how the wavelet variance of a time series behaves out in the wild. Here’s the code I used to create the figures: . Specifically, let’s study the simulated data plotted below which consists of 63 days of minute-by-minute return data with day-specific shocks:

(15)   \begin{align*} r_t &= \mu_{r,t} + \sigma_r \cdot \epsilon_t \qquad \text{with} \qquad \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1) \end{align*}

where the volatility of the process is given by \sigma_r = 0.01{\scriptstyle \mathrm{bp}/\sqrt{\mathrm{min}}} and there is a 5{\scriptstyle \%} probability of realizing a \mu_{r,t} = \pm 0.001{\scriptstyle \mathrm{bp}/\mathrm{min}} shock on any given day. The 4 days on which the data realized a shock are highlighted in red. These minute-by-minute figures amount to a 0{\scriptstyle \%/\mathrm{yr}} annualized return and a 31{\scriptstyle \%/\mathrm{yr}} annualized volatility.

plot--why-use-wavelet-variance--daily-shocks--time-series--25jul2014

The figure below then plots the wavelet coefficients, {\boldsymbol \theta}, at each horizon associated with this time series. A trading day is 6.5 \times 60 = 390{\scriptstyle \mathrm{min}}, so notice the spikes in the coefficient values in the h=6,7,8 panels near the day-specific shock dates corresponding to comparing successive 64, 128, and 256 minute intervals. The remaining variation in the coefficient levels comes from the underlying white noise process \epsilon_t. Because the break points in the wavelet projection affect the estimated coefficients, each data point in the plot actually represents the average of the coefficient estimates \theta_t(h,i) at a given point of time for all possible starting dates. See Percival and Walden (2000, Ch. 5) on the maximal overlap discrete wavelet transform for details.

plot--why-use-wavelet-variance--daily-shocks--wavelet-coefficients--25jul2014

Finally, I plot the \log of the wavelet variance at each horizon h for both the simulated return process (red) and a white noise process with an identical mean and variance (blue). Note that I’ve switched from \log_2 to \log_e on the x-axis here, so a spike in the amount of variance at h=6 corresponds to a spike in the amount of variance explained by successive e^{6} \approx 400{\scriptstyle \mathrm{min}} increments. This is exactly what you’d expect for day-specific shocks which have a duration of 390{\scriptstyle \mathrm{min}} as indicated by the vertical gray line. The wavelet variance of an appropriately scaled white noise process gives a nice comparison group. To see why, note that for covariance stationary processes like white noise, the wavelet variance at a particular horizon is related to the power spectrum as follows:

(16)   \begin{align*} V(h) &\approx 2 \cdot \int_{\sfrac{1}{2^{h+1}}}^{\sfrac{1}{2^h}} S(f) \cdot df \end{align*}

Thus, the wavelet variance of white noise should follow a power law with:

(17)   \begin{align*} V(h) &\propto 2^{-h} \end{align*}

giving a nice smooth reference point in plots.

plot--why-use-wavelet-variance--daily-shocks--wavelet-variance--25jul2014

5. Comparing Techniques

I conclude by considering how the wavelet variance statistic compares to other ways that a financial economist might look for horizon specific effects in data. I consider 2 alternatives: auto-regressive models and spectral density estimators. First, consider estimating the auto-regressive model below with lags \ell = 1,2,\ldots,L:

(18)   \begin{align*} r_t &= \sum_{\ell=1}^L C(\ell) \cdot r_{t-\ell} + \xi_t \qquad \text{where} \qquad \xi_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\xi}^2) \end{align*}

The left-most panel of the figure below reports the estimated values of C(\ell) for lags \ell = 1,2,\ldots,420 using the simulated data (red) as well as a scaled white noise process (blue). Just as before, the vertical grey line denotes the number of minutes in a day. There is no meaningful difference between the 2 sets of coefficients. The reason is that the day-specific shocks are asynchronous. They aren’t coming at regular intervals. Thus, no obvious lag structure can emerge from the data.

plot--why-use-wavelet-variance--daily-shocks--analysis--25jul2014

Next, let’s think about estimating the spectral density of \mathbf{r}. This turns out to be the exact same exercise as the auto-regressive model estimation in different clothing. As shown in an earlier post, it’s possible to flip back and forth between the coefficients of an \mathrm{AR}(L) process and its spectral density via the relationship:

(19)   \begin{align*} S(f) &= \frac{\sigma_{\epsilon}^2}{\left( \, 1 - \sum_{\ell=1}^L C(\ell) \cdot e^{-i \cdot 2 \cdot \pi \cdot f \cdot \ell} \, \right)^2} \end{align*}

This one-to-one mapping between the frequency domain and the time domain for covariance stationary processes is known as the Wiener–Khinchin theorem with \sigma_x^2 = \int_{-\sfrac{1}{2}}^{\sfrac{1}{2}} S(f) \cdot df. Thus, the spectral density plot just reflects the same random noise as the auto-regressive model coefficients because of the same issue with asynchrony. The most interesting features of the middle panel occur at really high frequencies which have nothing to do with the day-specific shocks.

Here’s the punchline. The wavelet variance is the only estimator of the 3 which can identify horizon-specific contributions to a time series’ variance which are not stationary.