WSJ Article Subject Tags

1. Motivation

Screen Shot 2014-07-18 at 5.44.43 PM

This post investigates the distribution of subject tags for Wall Street Journal articles that mention S&P 500 companies. e.g., a December 2009 article entitled, When Even Your Phone Tells You You’re Drunk, It’s Time to Call a Taxi, about a new iPhone app that alerted you when you were too drunk to drive had the meta data to the right. The subject tags are essentially article keywords. I collect every article that references an S&P 500 company over the period from 01/01/2008 to 12/31/2012. It is an appendix to my paper, Local Knowledge in Financial Markets.

I find that there is substantial heterogeneity in how many different topics people write about when discussing a company even after controlling for the number of total articles. e.g., there were 87 articles in the WSJ referencing Garmin (GRMN) and 81 articles referencing Sprint (S); however, while there were only 87 different subject tags used in the articles about Garmin, there were 716 different subject tags used in the articles about Sprint! This finding is consistent with the idea that some firms face a much wider array of shocks than others. i.e., the width of the market matters.

2. Data Collection

The data are hand-collected from the ProQuest newspaper archive by an RA. Data collection process for an example company, Agilent Technologies (A), is summarized in the 3 figures below. First, we searched for each company included in the S&P 500 from 01/01/2008 to 12/31/2012 [list]. Then, after each query, we restricted the results to articles found in the WSJ. Finally, we downloaded the articles and meta data in HTML format.

After the RA collected all of the data, I used a Python script to parse the resulting HTML files into a form I could manage in R. Roughly 4000 of the downloaded articles were duplicates resulting from the WSJ publishing the same article in different editions. I identify these observations by checking for articles published on the same day with identical word counts about the same companies. I tried using Selenium to automate the data collection process, but the ProQuest web interface proved too finicky.

3. Summary Statistics

My data set contains 106{\scriptstyle \mathrm{k}} articles over 5 years about 542 companies. Many articles reference multiple S&P 500 companies. The figure below plots the total number of articles in the database per month. There is a steady downward trend. The first part of the sample was the height of the financial crisis, so this makes sense. As markets have calmed down, journalists have devoted fewer articles to corporate news relative to other things such as politics and sports.

plot--wsj-articles-about-sp500-companies-per-month--18jul2014

Articles are not evenly distributed across companies as shown by the figure below. While the median company is only referenced in 21 articles over the sample period, the 5 most popular companies (United Parcel Service [UPS], Apple [AAPL], Goldman Sachs [GS], Citibank [C], and Ford [F]) are all referenced in at least 1922 different articles a piece. By comparison, the least popular 1{\scriptstyle \%} of companies are mentioned in only 1 article in 5 years.

plot--articles-per-firm--18jul2014

Counting subject tags is a bit less straight-forward than counting articles. I not count tags that are specific to the WSJ rather than the company. e.g., tags containing “(wsj)” flagging daily features like “Abreast of the market (wsj).” I also remove missing subjects. It’s worth pointing out that sometimes the meta data for an article doesn’t contain any subject information. After restrictions, the data contain 10{\scriptstyle \mathrm{k}} unique subject tags.

The distribution of subject tag counts per month is similar to that of article counts as shown in the figure below but with a less pronounced downward trend. To create this figure, I count the number of unique subject tags used each month. e.g., if “technology shock” is used 2 times in Jan 2008, then this counts as 1 of the 1591 tags used in this month; whereas, if “technology shock” is then used again on Feb 1st 2008, then I count this 3rd observation towards the total in February. Thus, the sum of the points in the time series will exceed 10{\scriptstyle \mathrm{k}}. Also, note that different articles can have identical subject tags.

plot--wsj-subjects-about-sp500-companies-per-month--18jul2014

As shown in the figure below, the distribution of subject tags used to describe articles about each company is less skewed than the actual article count for each company. There are 179 different subject tags used in the 21 articles about the median S&P 500 company during the sample period. The most tagged companies have 10 times as many subjects as the median firm; whereas, the most written about companies are referenced in 100 times as many articles as the median firm.

plot--subject-tags-per-firm--18jul2014

4. Articles per Tag

In order for the distribution of tags per company to be less skewed than the distribution of articles per company, it’s got to be the case that some tags are used in lots of articles. This is exactly what’s going on in the data. The figure below shows that the median subject tag is used in only 3 articles and the bottom 25{\scriptstyle \%} of tags are used in only 1 article; however, the top 1{\scriptstyle \%} of tags are used in 466 articles or more. e.g., there are roughly 100 tags out of the 10{\scriptstyle \mathrm{k}} unique subject tags in my data set that are used 500 times are more. Likewise, there are well over 3000 that are used only once!

plot--articles-per-subject-tag--18jul2014

This fact strongly supports the intuition that companies–even huge companies like those in the S&P 500—are constantly hit with new and different shocks. Traders have to figure out which aspect of the company matters. This is clearly not an easy problem to solve. Lot’s of ideas are thrown around. Many of them must be eitehr short lived or wrong. Roughly 1 out of every 4 topics worth discussing is only worth discussing once.

5. Coverage Depth

I conclude this post by looking at the variation in the number of subject tags across firms with a similar number of articles. e.g., I want to know if there are pairs of firms which journalist spend roughly the same amount of time talking about, but which get covered in very different ways. It turns out there are. The Garmin and Sprint example from the introduction is one such case. The figure below shows that there are many more. i.e., it shows that companies that are referenced in more articles also have more subject tag descriptors, but conditional on the number of articles there is still a lot of variation. The plot is on a \log_{10} \times \log_{10} scale, so a 1 tick vertical movement means a factor of 10 difference between the number of tags for 2 firms with similar article counts. Looking at the figure, it’s clear that this sort of variation is the norm.

plot--articles-vs-subjects--18jul2014

Randomized Market Trials

1. Motivation

How much can traders learn from past price signals? It depends on what kind of assets sell. Suppose that returns are (in part) a function of K = \Vert {\boldsymbol \alpha} \Vert_{\ell_0} different feature-specific shocks:

(1)   \begin{align*} r_n &= \sum_{q=1}^Q \alpha_q \cdot x_{n,q} + \epsilon_n \qquad \text{with} \qquad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}

If {\boldsymbol \alpha} is identifiable, then different values of {\boldsymbol \alpha} have to produce different values of r_n. This is only the case if assets are sufficiently different from one another. e.g., consider the analogy to randomized control trials. In an RCT, randomizing which subjects get thrown in the treatment and control groups makes it exceptionally unlikely that, say, all the people in the treatment group will by chance happen to all have some other common trait that actually explains their outcomes. Similarly, randomizing which assets get sold makes makes it exceptionally unlikely that 2 different choices of {\boldsymbol \alpha} and {\boldsymbol \alpha}' can explain the observed returns.

This post sketches a quick model relating this problem to housing prices. To illustrate, imagine N = 4 houses have sold at a discount in a neighborhood that looks like this:

tract-housing

The shock might reflect a structural change in the vacation home market whereby there is less disposable income to buy high end units—i.e., a permanent shift. Alternatively, the shock might have been due to a couple of out-of-town second house buyers needing to sell quickly—i.e., a transient effect. The houses in the picture above are all vacation homes of a similar quality with owners living in LA. Since there is so little variation across units, both these explanations are observationally equivalent. Thus, the asset composition affects how informative prices are in an important way. The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks.

2. Toy Model

Suppose you’ve seen N sales in the area. Most of the prices looked just about right, but some of the houses sold for a bit more than you would have expected and some sold for a bit less than you would have expected. You’re trying to decide whether or not to buy the (N+1)th house if the transaction costs are \mathdollar c today:

(2)   \begin{align*} U &= \max_{\{\text{Buy},\text{Don't}\}} \left\{ \, \mathrm{E}\left[ r_{N+1} \right] - \frac{\gamma}{2} \cdot \mathrm{Var}\left[ r_{N+1} \right] - c, \, 0 \, \right\} \end{align*}

You will buy the house if your risk adjusted expectation of its future returns exceeds the transaction costs, \mathrm{E}[r_{N+1}] - \sfrac{\gamma}{2} \cdot \mathrm{Var}[r_{N+1}] \geq c.

This problem hinges on your ability to estimate {\boldsymbol \alpha}. What’s the best you could ever hope to do? Well, suppose you knew which K features mattered ahead of time and the elements of \mathbf{X} were given by x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{K}). In this setting, your average estimation error per relevant feature is given by:

(3)   \begin{align*} \Omega^\star = \mathrm{E}\left[ \, \frac{1}{K} \cdot \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 \, \right] &= \frac{K \cdot \sigma_{\epsilon}^2}{N} \end{align*}

i.e., it’s as if you ran an OLS regression of the N price changes on the K relevant columns of \mathbf{X}. You will buy the house if:

(4)   \begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( \frac{K + N}{N}  \right) \cdot \sigma_{\epsilon}^2 &\geq c \end{align*}

In the real world, however, you generally don’t know which K features are important ahead of time and each house’s amenities are not taken as an iid draw. Instead, you must solve \ell_1-type inference problem:

(5)   \begin{align*} \widehat{\boldsymbol \alpha} &= \arg \min_{\boldsymbol \alpha} \sum_{n=1}^N \left( r_n - \mathbf{x}_n^{\top} {\boldsymbol \alpha} \right)^2 \qquad \text{s.t.} \qquad \left\Vert {\boldsymbol \alpha} \right\Vert_{\ell_1} \leq \lambda \cdot \sigma_{\epsilon} \end{align*}

with a correlated measurement matrix, \mathbf{X}, using something like LASSO. In this setting, you face feature selection risk. i.e., you might focus on the wrong causal explanation for the past price movements. If \Omega^{\perp} denotes your estimation error when each of the elements x_{n,q} are drawn independently and \Omega denotes your estimation error in the general case when \rho(x_{n,q},x_{n',q}) \neq 0, then:

(6)   \begin{align*} \Omega^{\star} \leq \Omega^{\perp} \leq \Omega \end{align*}

Since your estimate of \widehat{\boldsymbol \alpha} is unbiased, feature selection risk will simply increase \mathrm{Var}[r_{N+1}] making it less likely that you will buy the house in this stylized model:

(7)   \begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( K \cdot \Omega + \sigma_{\epsilon}^2 \right) &\geq c \end{align*}

More generally, it will make prices slower to respond to shocks and allow for momentum.

3. Matrix Coherence

Feature selection risk is worst when assets all have really correlated features. Let \mathbf{X} denote the (N \times Q)-dimensional measurement matrix containing all the features of the N houses that have already sold in the market:

(8)   \begin{align*} \mathbf{X} &= \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,Q} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,Q} \\ \vdots  & \vdots  & \ddots & \vdots  \\ x_{N,1} & x_{N,2} & \cdots & x_{N,Q} \\ \end{bmatrix} \end{align*}

Each row represents all of the features of the nth house, and each column represents the level to which the N assets display a single feature. Let \widetilde{\mathbf{x}}_q denote a unit-normed column from this measurement matrix:

(9)   \begin{align*} \widetilde{\mathbf{x}}_q &= \frac{\mathbf{x}_q}{\sqrt{\sum_{n=1}^N x_{n,q}^2}} \end{align*}

I use a measure of the coherence of \mathbf{X} to quantify the extent to which all of the assets in a market have similar features.

(10)   \begin{align*} \mu(\mathbf{X}) &= \max_{q \neq q'} \left\vert \left\langle \widetilde{\mathbf{x}}_q, \widetilde{\mathbf{x}}_{q'} \right\rangle \right\vert \end{align*}

e.g., the coherence of a matrix with x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N}) is roughly \sqrt{2 \cdot \log(Q)/N} corresponding to the red line in the figure below. As the correlation between elements in the same column increases, the coherence increases since different terms in the above cross-product are less likely to cancel out.

plot--mutual-coherence-gaussian-matrix--15jul2014

4. Selection Risk

There is a tight link between the severity of the selection risk and how correlated asset features are. Specifically, Ben-Haim, Eldar, and Elad (2010) show that if

(11)   \begin{align*} \alpha_{\min} \cdot \left( 1 - \{2 \cdot K - 1\} \cdot \mu(\mathbf{X}) \right) &\geq 2 \cdot \sigma_{\epsilon} \cdot \sqrt{2 \cdot (1 + \xi) \cdot \log(Q)} \end{align*}

for some \xi > 0, then:

(12)   \begin{align*} \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 &\leq \frac{2 \cdot (1 + \xi)}{(1 - (K-1)\cdot \mu(\mathbf{X}))^2} \times K \cdot \sigma_{\epsilon}^2 \cdot \log(Q) = \Omega \end{align*}

with probability at least:

(13)   \begin{align*} 1 - Q^{-\xi} \cdot \left( \, \pi \cdot (1 + \xi) \cdot \log(Q) \, \right)^{-\sfrac{1}{2}} \end{align*}

where \alpha_{\min} = |\arg \min_{q \in \mathcal{K}} \alpha_q|. Let’s plug in some numbers. If \alpha_{\min} = 0.10 and \sigma_{\epsilon} = 0.05, then the result means that \Vert \widehat{\boldsymbol \alpha} - {\boldsymbol \alpha} \Vert_{\ell_2}^2 is less than 0.185 \times K \cdot \log(Q) with probability \sfrac{3}{4}.

There are a couple of things worth pointing out here. First, the recovery bounds only hold when \mathbf{X} is sufficiently incoherent:

(14)   \begin{align*} \mu(\mathbf{X}) < \frac{1}{2 \cdot K - 1} \end{align*}

i.e., when the assets are too similar, we can’t learn anything concrete about which amenity-specific shocks are driving the returns. Second, the free parameter \xi > 0 links the probability of seeing an error rate outside the bounds, p, to the number of amenities that houses have:

(15)   \begin{align*} \xi &\approx \frac{\log(\sfrac{1}{p}) - \frac{1}{2} \cdot \log\left[ \pi \cdot \log Q \right]}{\sfrac{1}{2} + \log(Q)} \end{align*}

If you want to lower this probability, you need to either use a larger constant or decrease the number of amenities. For \xi large enough we can effectively regard the error bounds as the variance. Importantly, this quantity is increasing in the coherence of the measurement matrix. i.e., when assets are more similar, I am less sure that I am drawing the correct conclusion from past returns.

5. Empirical Predictions

The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks. e.g., imagine studying the price paths of 2 neighborhoods, A and B, which have houses of the exact same value, \mathdollar v. In neighborhood A, each of the houses has a very different collection of amenities whose values sum to \mathdollar v; whereas, in neighborhood B, each of the houses has the exact same amenities whose values sum to \mathdollar v. e.g., you can think about neighborhood A as pre-war and neighborhood B as tract housing. The theory says that the price of houses in the neighborhood B should respond slower to amenity-specific value shocks because houses have more correlated amenities—i.e., \Omega is larger. As a result, home prices in neighborhood B should also display more momentum… though this is not in the toy model above.

Notes: Ang, Hodrick, Xing, and Zhang (2006)

1. Introduction

In this post I work through the main results in Ang, Hodrick, Xing, and Zhang (2006) which shows not only that i) stocks with more exposure to changes in aggregate volatility have lower average excess returns, but also that ii) stocks with more idiosyncractic volatility relative to the Fama and French (1993) 3 factor model have lower excess returns. The first result is consistent with existing asset pricing theories; whereas, the second result is at odds with almost any mainstream asset pricing theory you might write down. Idiosyncratic risk should not be priced. This paper together with Campbell, Lettau, Malkiel, and Xu (2001) (see my earlier post) set off an investigation into the role of idiosyncratic risk in determining asset prices. One possibility is that idiosyncratic risk is just a proxy for exposure to aggregate risk. i.e., perhaps it’s the firms with the highest exposure to aggregate return volatility that also have the highest idiosyncratic volatility. Interestingly, Ang, Hodrick, Xing, and Zhang (2006) show that this is not the case via a double sort on both aggregate and idiosyncratic volatility exposure giving evidence that these are 2 separate risk factors. The code I use to replicate the results in Ang, Hodrick, Xing, and Zhang (2006) and create the figures can be found here.

2. Theoretical Motivation

The discount factor view of asset pricing says that:

(1)   \begin{align*} 0 = \mathrm{E}[m \cdot r_n] \quad \text{for all } n=1,2,\ldots,N \end{align*}

where \mathrm{E}(\cdot) denotes the expectation operator, m denotes the stochastic discount factor, and r_n denotes asset n‘s excess return. Equation (1) reads: “In the absence of margin requirements and transactions costs, it costs you \mathdollar 0 today to borrow at the riskless rate, buy a stock, and hold the position for 1 period.” Asset pricing theories explain why average excess returns, \mathrm{E}[r_n], vary across assets even though they all have the same price today by construction (see my earlier post).

Suppose each asset’s excess returns are a function of a risk factor x, \mathrm{R}_n(x), and noise, z_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_z^2):

(2)   \begin{align*} r_n  &= \mathrm{R}_n(x) + z_n \\ &= \mathrm{R}_n(\mu_x) + \mathrm{R}_n'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{R}_n''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + z_n \\ &\approx \alpha_n + \beta_n \cdot (x - \mu_x) + \frac{\gamma_n}{2} \cdot (x - \mu_x)^2 + z_n \end{align*}

where I assume for simplicity that the only risk factor is the value-weighted excess return on the market so that \mu_x \approx 6{\scriptstyle \%/\mathrm{yr}} and \sigma_x \approx 16{\scriptstyle \%/\mathrm{yr}}. I use a Taylor expansion to linearize the function \mathrm{R}_n(x) around the point x = \mu_x and assume \mathrm{O}(x - \mu_x)^3 terms are negligible so \mathrm{E}[r_n] = \alpha_n + \sfrac{\gamma_n}{2} \cdot \sigma_x^2 and \mathrm{Var}[r_n] = \beta_n^2 \cdot \sigma_x^2 + \sigma_z^2. This means that if the excess return on the market is \sfrac{16{\scriptstyle \%}}{\sqrt{252}} \approx 1{\scriptstyle \%/\mathrm{day}} larger than expected, then asset n‘s expected excess returns will be \beta_n{\scriptstyle \%} larger.

Any asset pricing theory says that each asset’s expected excess return should be proportional to how much the asset comoves with the risk factor, x:

(3)   \begin{align*} \mathrm{E}[r_n]  = \alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2 =  \underbrace{\text{Constant} \times \beta_n}_{\text{Predicted}} \end{align*}

where the constant of proportionality, \text{Constant} = c \cdot (\sfrac{\mathrm{Var}[m]}{\mathrm{E}[m]}), depends on the exact asset pricing model. Equation (3) says that if you ran a regression of each stock’s excess returns on the aggregate risk factor:

(4)   \begin{align*} r_{n,t} = \widehat{\alpha}_n + \widehat{\beta}_n \cdot x_t + \mathit{Error}_{n,t} \end{align*}

then the estimated intercept for each stock should be:

(5)   \begin{align*} \widehat{\alpha}_n = \frac{\gamma_n}{2} \cdot \sigma_x^2 - \beta_n \cdot \mu_x \end{align*}

Thus, each stock’s average excess returns may well be related to its exposure to aggregate volatility since \sigma_x shows up in the expression for \widehat{\alpha}_n; however, idiosyncratic volatility, \sigma_z, better not be priced since it shows up nowhere above.

3. Aggregate Volatility

Ang, Hodrick, Xing, and Zhang (2006) show that stocks with more exposure to aggregate volatility have lower average excess returns. i.e., that the coefficient \gamma_n < 0. The authors actually look at each stock’s exposure to changes in aggregate volatility. To see how this changes the math, consider rewriting the intercept above as:

(6)   \begin{align*} \widehat{\alpha}_n = \mathrm{A}_n(\Delta \sigma_x) &= \alpha_n + \frac{\gamma_n}{2} \cdot \left(\langle \sigma_x \rangle + \Delta \sigma_x \right)^2 \end{align*}

Using this formulation, we can look at how perturbing \mathrm{A}_n(\Delta \sigma_x) around its mean with some small \Delta \sigma_x will impact the estimated intercept:

(7)   \begin{align*} \mathrm{A}_n(\Delta \sigma_x) &= \mathrm{A}_n(0) + \mathrm{A}_n'(0) \cdot \Delta \sigma_x + \cdots \\ &\approx \left[ \alpha_n + \frac{\gamma_n}{2} \cdot \langle\sigma_x\rangle^2 \right] + \gamma_n \cdot \langle\sigma_x\rangle \cdot \Delta \sigma_x \end{align*}

Since \langle \sigma_x \rangle > 0 by definition, (\sfrac{\gamma_n}{2}) \cdot \langle \sigma_x \rangle^2 and \gamma_n \cdot \langle \sigma_x \rangle will have the same sign. Thus, testing for whether exposure to changes in aggregate volatility is priced is tantamount to testing for whether exposure to aggregate volatility is priced.

The authors proceed in 5 steps. First, they calculate the changes in aggregate volatility time series using changes in the daily options implied volatility:

(8)   \begin{align*} \Delta \sigma_{x,d+1} = \mathit{VXO}_{d+1} - \mathit{VXO}_d \qquad \text{with} \qquad  \mathrm{E}[\Delta \sigma_{x,d+1}] = 0.01{\scriptstyle \%}, \, \mathrm{StD}[\Delta \sigma_{x,d+1}] = 2.65{\scriptstyle \%} \end{align*}

If the VXO is 4.33{\scriptstyle \%}, then options markets expect the S&P 100 to move up or down 4.33{\scriptstyle \%} over the next 30 calendar days. The authors use the VXO contract price rather than the VIX contract price because it has a longer time series dating back to 1986. The only difference between the 2 contracts is that the VXO quotes the options implied volatility on the S&P 100; whereas, the VIX quotes the options implied volatility on the S&P 500. Daily changes in the 2 contract prices have a correlation of 0.81 over the sample period from January 1986 to December 2012 as shown in the figure below.

plot--vix-vs-vxo-daily-data--04may2014

Second, the authors compute each stock’s exposure to changes in aggregate volatility by running a regression for each stock n \in \{1,2,\ldots,N\} using the daily data in month (m-1):

(9)   \begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\beta}_{n} \cdot x_d + \widehat{\gamma}_{n} \cdot \Delta \sigma_{x,d} + \mathit{Error}_{n,d} \end{align*}

Estimated coefficients are related to underlying deep parameters by:

(10)   \begin{align*} \widehat{\alpha}_n &= \frac{\gamma_n}{2} \cdot \langle \sigma_x \rangle^2 - \beta_n \cdot \mu_x \\ \widehat{\beta}_n &= \beta_n \\ \widehat{\gamma}_n &= \gamma_n \cdot \langle \sigma_x \rangle \end{align*}

The daily market excess return, x_d, is the excess return on the CRSP value-weighted market index. I include AMEX, NYSE, and NASDAQ stocks with \geq 17 daily observations in month (m-1) in my universe of N stocks.

plot--aggregate-volatility-portfolio-cumulative-returns

Third, the authors sort the N stocks satisfying the data constraints in month (m-1) into 5 value-weighted portfolios based on their estimated \widehat{\gamma}_{n}. Note that because the factor \langle \sigma_x \rangle is common to all stocks in month (m-1), this sort effectively organizes stocks by their true exposure to aggregate volatility, \gamma_n. For each portfolio j \in \{\text{L},2,3,4,\text{H}\} with j = \text{L} denoting the stocks with the lowest aggregate volatility exposure and j = \text{H} denoting the stocks with the highest aggregate volatility exposure, the authors then calculate the daily portfolio returns in month m. The figure above shows the cumulative returns to each of these 5 portfolios. It reads that if you invested \mathdollar 1 in the low aggregate volatility exposure portfolio in January 1986, then you would have over \mathdollar 200 more dollars in December 2012 than if you had invested that same \mathdollar 1 in the high aggregate volatility exposure portfolio. What’s more, each portfolio’s exposure to the excess return on the market is not explaining its performance. The figure below reports the estimated intercepts for each j \in \{\text{L},2,3,4,\text{H}\} from the regression:

(11)   \begin{align*} r_{j,m} = \widehat{\alpha}_j + \widehat{\beta}_j \cdot x_m + \mathit{Error}_{j,m} \end{align*}

and indicates that abnormal returns are decreasing in the portfolio’s exposure to aggregate volatility.

plot--ahxz06-table-1--capm-alphas

Fourth, in order to test whether the spread in portfolio abnormal returns is actually explained by contemporaneous exposure to aggregate volatility, the authors then create an aggregate volatility factor mimicking portfolio. They estimate the regression below using the daily excess returns on each of the 5 aggregate volatility exposure portfolios in each month m:

(12)   \begin{align*} \Delta \sigma_{x,d} = \widehat{\kappa} + \sum_{j=\text{L}}^{\text{H}} \widehat{\lambda}_{j} \cdot r_{j,d} + \mathit{Error}_d \end{align*}

and store the parameter estimates for \begin{bmatrix} \widehat{\lambda}_1 & \widehat{\lambda}_2 & \widehat{\lambda}_3 & \widehat{\lambda}_4 & \widehat{\lambda}_5 \end{bmatrix}^{\top}. They then define the factor mimicking portfolio return at daily horizon in month m as:

(13)   \begin{align*}  f_d = \sum_{j=\text{L}}^{\text{H}} \widehat{\lambda}_{j} \cdot r_{j,d} \end{align*}

The figure below plots the factor mimicking portfolio returns against the underlying changes in aggregate volatility at the monthly level. The 2 data series line up relatively closely; however, the factor mimicking portfolio is much too volatile during crises such as Black Monday in 1987.

plot--aggregate-volatility-factor

Fifth and finally, the authors check whether or not each of the 5 aggregate volatility portfolio’s returns are positively correlated with contemporaneous movements in the aggregate volatility factor mimicking portfolio at the monthly horizon. To do this, they cumulate up daily excess returns on the factor mimicking portfolio and the aggregate volatility exposure sorted portfolios to get monthly returns:

(14)   \begin{align*} f_m &= \sum_{d=1}^{22} f_d \\ r_{j,m} &= \sum_{d=1}^{22} r_{j,d} \quad \text{for all } j \in \{\text{L},2,3,4,\text{H}\} \end{align*}

Then, they run the regression below at a monthly horizon over full sample:

(15)   \begin{align*} r_{j,m} = \widehat{\zeta}_j + \widehat{\eta}_j \cdot x_m  + \widehat{\theta}_j \cdot f_m + \mathit{Error}_{j,m} \end{align*}

I report the estimated \widehat{\theta}_j coefficients in the figure below. Consistent with the idea that exposure to aggregate volatility is driving the disparate excess returns of the 5 test portfolios, I find that each portfolio loads positively on monthly movements in the factor mimicking portfolio.

plot--ahxz06-table-1--factor-loadings

4. Idiosyncratic Volatility

Ang, Hodrick, Xing, and Zhang (2006) also show that stocks with more idiosyncratic volatility have lower average excess returns. This should not be true under the standard theory outlined in Section 2 above. To measure idiosyncratic volatility, the authors run the regression below at the daily level in month (m-1) for each stock n = 1,2,\ldots,N:

(16)   \begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \cdot \mathbf{x}_d + \mathit{Error}_{n,d} \end{align*}

where the risk factors are the excess return on the value weighted market portfolio, the excess return on a size portfolio, and the excess return on a value portfolio as dictated by Fama and French (1993):

(17)   \begin{align*} \mathbf{x}_d^{\top} = \begin{bmatrix} r_{\mathrm{Mkt},d} & r_{\mathrm{SmB},d} & r_{\mathrm{HmL},d} \end{bmatrix} \end{align*}

For each stock listed on the AMEX, NYSE, or NASDAQ stock exchange with \geq 17 daily observations in month (m-1), the authors then calculate the measure of idiosyncratic volatility below:

(18)   \begin{align*} \sigma_{z,n} &= \mathrm{StD}[\mathit{Error}_{n,d}] \end{align*}

plot--idiosyncratic-volatility-portfolio-cumulative-returns

The authors sort the N stocks satisfying the data constraints in month (m-1) into 5 value-weighted portfolios based on their estimated \sigma_{z,n} values. The figure above reports the cumulative returns to these 5 test portfolios. The figure reads that if you invested \mathdollar 1 in the low idiosyncratic volatility portfolio in January 1963, then you would have over \mathdollar 100 more in December 2012 than if you had invested in the high idiosyncratic volatility portfolio. The figure below reports the estimated abnormal returns, \widehat{\alpha}_j, for each of the idiosyncratic volatility portfolios over the full sample and confirms that the poor performance of the high idiosyncratic volatility portfolio cannot be explained by exposure to common risk factors.

plot--ahxz06-table-6--capm-alphas

5. Are They Related?

I conclude by discussing the obvious follow-up question: “Are these 2 phenomena related?” After all, it could be the case that the firms with the highest exposure to aggregate return volatility also have the highest idiosyncratic volatility and vice versa. Ang, Hodrick, Xing, and Zhang (2006) show that this is not the case via a double sort. i.e., they show that within each aggregate volatility exposure portfolio, the stocks with the lowest idiosyncratic volatility outperform the stocks with the highest idiosyncratic volatility. Similarly, they show that within each idiosyncratic volatility portfolio, the stocks with the lowest aggregate volatility exposure outperform the stocks with the highest aggregate volatility exposure. Thus, the motivation driving investors to pay a premium for stocks with high aggregate volatility exposure is different from the motivation driving investors to pay a premium for stocks with high idiosyncratic volatility.

plot--r2-portfolios--capm-alphas

Indeed, you can pretty much guess this fact from the cumulative return plots in Sections 3 and 4 where the red lines denoting the low exposure portfolios behave in completely different ways. e.g., the low aggregate volatility exposure portfolio returns behave more or less like the high aggregate volatility exposure portfolio returns but with a higher mean. By contrast, the low idiosyncratic volatility portfolio returns are a much different time series with dramatically less volatility. Interestingly, if the authors sort on total volatility in month (m-1) rather than idiosyncratic volatility, then results are identical; however, the results to not carry through if you sort on R^2 in month (m-1). e.g., suppose you ran the same regression at the daily level in month (m-1) for each stock n = 1,2,\ldots,N:

(19)   \begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \cdot \mathbf{x}_d + \mathit{Error}_{n,d} \end{align*}

where the risk factors are the excess return on the value weighted market portfolio, the excess return on a size portfolio, and the excess return on a value portfolio as dictated by Fama and French (1993):

(20)   \begin{align*} \mathbf{x}_d^{\top} = \begin{bmatrix} r_{\mathrm{Mkt},d} & r_{\mathrm{SmB},d} & r_{\mathrm{HmL},d} \end{bmatrix} \end{align*}

Then, for each stock you computed the R^2 statistic measuring the fraction of the total variation in each stock’s excess returns that is explained by movements in the risk factors:

(21)   \begin{align*} R^2 &= 1 - \frac{\sum_{d=1}^{22}(r_{n,d} - \{\widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \mathbf{x}_d\})^2}{\sum_{d=1}^{22}(r_{n,d} - \langle r_{n,d} \rangle)^2} \end{align*}

If you group stocks into 5 portfolios based on their R^2 over the previous month, the figure above shows that there is no monotonic spread in the abnormal returns. Thus, the idiosyncratic volatility results seem to be more about volatility and less about the explanatory power of the Fama and French (1993) factors.

Using the Cross-Section of Returns

1. Introduction

The empirical content of the discount factor view of asset pricing can all be derived from the equation below:

(1)   \begin{align*} 0 = \mathrm{E}[m \cdot r_n] \quad \text{for all } n=1,2,\ldots,N \end{align*}

where m denotes the prevailing stochastic discount factor and r_n denotes an asset’s excess return. Equation (1) reads: “In the absence of margin requirements and transactions costs, it costs you \mathdollar 0 today to borrow at the riskless rate, buy a stock, and hold the position for 1 period.” The question is then why average excess returns, \mathrm{E}[r_n], vary across the N assets even though they all have the same price today by construction.

The answer hinges on the behavior of the stochastic discount factor, m, in Equation (1). What is this thing? Everyone knows that it is better to have \mathdollar 1 today than \mathdollar 1 tomorrow, and the present value of an asset that pays out \mathdollar 1 tomorrow is the called the discount rate. Sometimes important stuff will happen in the next 24 hours that changes how awesome it is to have an additional \mathdollar 1 tomorrow. As a result, the realized discount rate is a random variable each period (i.e., follows a stochastic process). e.g., if agents have utility, \mathrm{U}_0 = \mathrm{E}_0 \sum_{t \geq 0} e^{\rho \cdot t} \cdot c_t^{1-\theta}, then the stochastic discount factor is m = e^{-\rho - \theta \cdot \Delta \log c} and the stuff (i.e., risk factor) is changes in log consumption.

asset-pricing-theory

An asset pricing model is a machine which takes as inputs a) each agent’s preferences, b) each agent’s information, and c) a list of the relevant risk factors affecting how agents discount the future and produces a stochastic discount factor as its output. In this post, I show how to test an asset pricing model using the cross-section of asset returns. i.e., by linking how average excess returns vary across assets to each asset’s exposure to the risk factors governing the behavior of the stochastic discount factor.

2. Theoretical Predictions

The key to massaging Equation (1) into a form that can be taken to the data is noticing that for any 2 random variables u and v, the following identity holds:

(2)   \begin{equation*}  \mathrm{E}[u\cdot v] = \mathrm{Cov}[u,v] - \mathrm{E}[u] \cdot \mathrm{E}[v] \end{equation*}

Thus, if I let u denote the stochastic discount factor and v denotes any of the N excess returns, I can link the expected excess return to holding an asset to its covariance with the stochastic discount factor:

(3)   \begin{align*} \mathrm{E}[r_n] &= \frac{\mathrm{Cov}[m, r_n]}{\mathrm{Var}[m]} \cdot \left( - \frac{\mathrm{Var}[m]}{\mathrm{E}[m]} \right) \end{align*}

The first term is dimensionless and represents the amount of exposure asset n has to the risk factor x. The second term has dimension \sfrac{1}{\Delta t}, is common across all assets, and represents the price of exposure to the risk factor x since it has the same units as the expected return \mathrm{E}[r_n]. Asset pricing theories say that each asset’s expected return should be proportional to the market-wide prices of risk where the constant on proportionality is the asset’s “exposure” to that risk factor.

What does “exposure” mean here? To answer this question I need to put a bit more structure on the stochastic discount factor, m, and the excess return, r_n. I remain agnostic about which asset pricing model actually governs returns and which risk factors that affect discount rates, but to avoid writing out lots of messy matrices I do assume that there is only a single factor, x, with \mathrm{E}[x] = \mu_x and \mathrm{Var}[x] = \sigma_x^2. I then write the stochastic discount factor as the sum of a function of x, \mathrm{M}(x), and some noise, y \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_y^2):

(4)   \begin{align*} m &= \mathrm{M}(x) + y \\ &= \mathrm{M}(\mu_x) + \mathrm{M}'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{M}''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + y \\ &\approx \phi + \chi \cdot (x - \mu_x) + \frac{\psi}{2} \cdot (x - \mu_x)^2 + y \end{align*}

where I use a Taylor expansion to linearize the function \mathrm{M}(x) around the point x = \mu_x and assume terms of order \mathrm{O}(x - \mu_x)^3 are negligible so that \mathrm{E}[m] = \phi + \sfrac{\psi}{2} \cdot \sigma_x^2 and \mathrm{Var}[m] = \chi^2 \cdot \sigma_x^2 + \sigma_y^2. This means that if the risk factor is \sigma_x larger than expected, (x - \mu_x) = \sigma_x, then agents value having an additional \mathdollar 1 tomorrow \chi \cdot \sigma_x more than usual. Similarly, suppose each excess return is the sum of an asset-specific function of x, \mathrm{R}_n(x), and some asset-specific noise, z_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_z^2):

(5)   \begin{align*} r_n  &= \mathrm{R}_n(x) + z_n \\ &= \mathrm{R}_n(\mu_x) + \mathrm{R}_n'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{R}_n''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + z_n \\ &\approx \alpha_n + \beta_n \cdot (x - \mu_x) + \frac{\gamma_n}{2} \cdot (x - \mu_x)^2 + z_n \end{align*}

where I use a Taylor expansion to linearize the function \mathrm{R}_n(x) around the point x = \mu_x and assume \mathrm{O}(x - \mu_x)^3 terms are negligible so that \mathrm{E}[r_n] = \alpha_n + \sfrac{\gamma_n}{2} \cdot \sigma_x^2 and \mathrm{Var}[r_n] = \beta_n^2 \cdot \sigma_x^2 + \sigma_z^2. This means that if the risk factor is \sigma_x larger than expected, (x - \mu_x) = \sigma_x, then asset n‘s realized excess returns will be \beta_n \cdot \sigma_x larger than average.

Plugging Equations (4) and (5) into Equation (3) then shows exactly what “exposure” to the risk factor means:

(6)   \begin{equation*} \begin{split} \mathrm{E}[r_n] &= \frac{\mathrm{Cov}[m,r_n]}{\mathrm{Var}[m]} \cdot \left( - \, \frac{\mathrm{Var}[m]}{\mathrm{E}[m]} \right) \\ &= \frac{\chi \cdot \beta_n \cdot \sigma_x^2}{\chi^2 \cdot \sigma_x^2 + \sigma_y^2} \cdot \left( - \, \frac{\chi^2 \cdot \sigma_x^2 + \sigma_y^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \\ &= - \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \beta_n \\ &= \text{Constant} \times \beta_n \end{split} \end{equation*}

Each asset’s exposure to the risk factor x is summarized by the coefficient \beta_n. Assets which have higher realized returns when the risk factor is high (have a large \beta_n) will have lower average returns (high prices) since these assets are good hedges against the risk factor. i.e., these assets look like insurance. Equation (1)’s empirical content is then that an asset’s average excess returns, \langle r_n \rangle, is proportional to its exposure to the risk factor, \beta_n, where the constant of proportionality is the same for all assets:

(7)   \begin{align*} \mathrm{E}[r_n]  = \underbrace{\alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2}_{\text{Realized } \langle r_n \rangle} =  \underbrace{- \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \beta_n}_{\text{Predicted}} \end{align*}

By letting y,z_n \searrow 0 we can interpret this relationship as a realization of the first Hansen-Jagannathan bound:

(8)   \begin{align*} \frac{\mathrm{StD}[m_{t+1}]}{\mathrm{E}[m_{t+1}]} = \frac{\chi \cdot \sigma_x}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} = \frac{\alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2}{\beta_n \cdot \sigma_x} = \left| \frac{\mathrm{E}[r_{n,t+1}]}{\mathrm{StD}[r_{n,t+1}]} \right| \end{align*}

3. Empirical Strategy

To test Equation (7), an econometrician has to estimate (2 \cdot N + 2) unknown parameters:

(9)   \begin{align*} \widehat{\boldsymbol \theta} = \begin{bmatrix} \widehat{\mu}_x & \widehat{\alpha}_1 & \cdots & \widehat{\alpha}_N & \widehat{\beta}_1 & \cdots & \widehat{\beta}_N & \widehat{\lambda} \end{bmatrix}^{\top} \end{align*}

using T periods of observations. i.e., 2 parameters for each asset (its average excess returns and its factor exposure) as well as 2 market-wide parameters (the risk factor mean and the market price of risk). There are (3 \cdot N + 1) equations to estimate these parameters with via GMM so that the system is over-identified whenever there are N > 1 assets:

(10)   \begin{align*} \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}  &= \mathrm{E}[\mathrm{G}(\widehat{\boldsymbol \theta};\mathbf{r}_t,x_t)] = \mathrm{E} \begin{bmatrix} x_t - \widehat{\mu}_x \\ r_{1,t} - \left\{ \widehat{\alpha}_1 + \widehat{\beta}_1 \cdot (x_t - \widehat{\mu}_x) \right\} \\ \vdots \\ r_{N,t} - \left\{ \widehat{\alpha}_N + \widehat{\beta}_N \cdot (x_t - \widehat{\mu}_x) \right\} \\ \left( r_{1,t} - \left\{ \widehat{\alpha}_1 + \widehat{\beta}_1 \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ \vdots \\ \left( r_{N,t} - \left\{ \widehat{\alpha}_N + \widehat{\beta}_N \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ r_{1,t} - \widehat{\beta}_1 \cdot \widehat{\lambda} \\ \vdots \\ r_{N,t} - \widehat{\beta}_N \cdot \widehat{\lambda} \end{bmatrix} \end{align*}

The first equation pins down the mean of the factor x. The following (2 \cdot N) equations identify the \{\widehat{\alpha}_n,\widehat{\beta}_n\}_{n \in N} parameters governing the relationship between the risk factor and each asset’s excess returns. The final N equations pin down the market price of risk, \widehat{\lambda}, for exposure to the risk factor x. A risk is “priced” if \widehat{\lambda} \neq 0.

Note that this empirical strategy doesn’t pin down every single one of the parameters governing the relationship between the stochastic discount factor and each asset’s excess returns. e.g., the parameter estimates \widehat{\alpha}_n and \widehat{\lambda} are composites of several deep parameters:

(11)   \begin{align*} \widehat{\alpha}_n &= \alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2 \\ \widehat{\lambda} &= - \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \end{align*}

The underlying parameters \alpha_n and \gamma_n as well as \phi, \chi, and \psi are not identifiable from this approach since they satisfy conservation laws which leave the estimates for \widehat{\alpha}_n and \widehat{\lambda} unchanged:

(12)   \begin{align*} \frac{\partial \widehat{\alpha}_n}{\partial \alpha_n} \cdot \Delta \alpha_n + \frac{\partial \widehat{\alpha}_n}{\partial \gamma_n} \cdot \Delta \gamma_n = 0 &= \Delta \alpha_n + \frac{\sigma_x^2}{2} \cdot \Delta \gamma_n \\ \frac{\partial \widehat{\lambda}}{\partial \phi} \cdot \Delta \phi + \frac{\partial \widehat{\lambda}}{\partial \chi} \cdot \Delta \chi + \frac{\partial \widehat{\lambda}}{\partial \psi} \cdot \Delta \psi = 0 &= \left( \frac{\chi}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \{\Delta \phi + \frac{\sigma_x^2}{2} \cdot \Delta \psi\} - \Delta \chi \end{align*}

e.g., if you increase \alpha_n by \epsilon \approx 0^+ and decrease \gamma_n by \frac{2}{\sigma_x^2} \cdot \epsilon, then the estimate of \widehat{\alpha}_n remains unchanged.

4. Time Scale Considerations

There is a hidden assumption floating around behind the empirical strategy outlined in Section 3 above. Namely, that each asset’s factor exposure is constant and the market price of risk is constant. In practice, this is surely not the case as is documented in Jagannathan and Wang (1996) and Lewellen and Nagel (2006). OK… so constant factor exposures and prices of risk is an approximation. Fine. How good/bad an approximation is it? e.g., Fama and MacBeth (1973) use rolling T = 60 month windows to estimate each asset’s \widehat{\beta}_n. Is this too long a window relative to how much factor exposures vary over time? Alternatively, should we be using a longer window to more accurately pin down these parameters? It turns out that the estimation strategy gives some guidance about the relationship between the optimal estimation window and parameter persistence which I discuss below.

First, I model the evolution of the true parameters. To test an asset pricing model using the cross-section of excess returns, we are interested in knowing whether or not \widehat{\lambda} = 0. Suppose the true market price of risk, \lambda, follows a random walk:

(13)   \begin{align*} \lambda_T = \lambda + \sum_{t=1}^T l_t \end{align*}

where l_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_l^2) so that the final \lambda_T is a random variable with distribution:

(14)   \begin{align*}  \lambda_T \sim \mathrm{N}(\lambda, T \cdot \sigma_l^2) \end{align*}

Second, I note that the estimation strategy outlined in Section 3 above gives signal, \widehat{\lambda}, about the average market price of risk with distribution:

(15)   \begin{align*} \widehat{\lambda} \sim \mathrm{N}\left(\lambda, \sfrac{\sigma_s^2}{T}\right) \end{align*}

where s_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_s^2) denotes estimation error from the GMM procedure. There is an additional complication to consider. Namely, if the true market price of risk is floating around during the estimation period, it will add additional noise to the parameter estimates and increase \sigma_s^2. To keep things simple, suppose that nature sets the market price of risk to \lambda at the beginning of the estimation sample and it remains constant during estimation period. Then, \lambda_T is revealed at the end of time T and prevails afterwards. This will mean that the derivations below will be inequalities due to the underestimate of \sigma_s^2.

What I really care about is the distance between the true \lambda_T at the end of the sample which governs the market going forward and the GMM estimate of \widehat{\lambda}. Thus, I should choose out sample period length, T, to minimize:

(16)   \begin{align*} T  = \arg \min_{T \geq 0} \mathrm{E}\left[ (\lambda_T - \widehat{\lambda})^2 \right] = \arg \min_{T \geq 0} \mathrm{E}\left[ (\lambda_T - \lambda)^2 + (\lambda - \widehat{\lambda})^2 \right] \end{align*}

As a result, to find the optimal T I take the first order condition:

(17)   \begin{align*} 0 = \frac{d}{dT} \left[ T \cdot \sigma_l^2 + \left(\frac{1}{\sigma_{\lambda}^2} + \frac{T}{\sigma_s^2} \right)^{-1} \right] \end{align*}

where \sigma_{\lambda}^2 denotes the variance of my priors about the market price of risk governing the estimation sample \lambda. The solution to this equation defines the window length, T, which optimally trades off the benefit of getting a more precise estimate of \lambda with the cost of decreasing the relevance of this estimate due to the evolution of \lambda_T.

GMM maps \sigma_s^2 onto a parameter of the underlying model. To keep things simple, suppose there is only 1 asset and 4 unknown parameters:

(18)   \begin{align*} \widehat{\boldsymbol \theta} = \begin{bmatrix} \widehat{\mu}_x & \widehat{\alpha} & \widehat{\beta} & \widehat{\lambda} \end{bmatrix}^{\top} \end{align*}

so that the system of estimation equations reduces to:

(19)   \begin{align*} \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \end{pmatrix}  &= \mathrm{E}[\mathrm{G}(\widehat{\boldsymbol \theta};r_t,x_t)] = \mathrm{E} \begin{bmatrix} x_t - \widehat{\mu}_x \\ r_t - \left\{ \widehat{\alpha} + \widehat{\beta} \cdot (x_t - \widehat{\mu}_x) \right\} \\ \left( r_t - \left\{ \widehat{\alpha} + \widehat{\beta} \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ r_t - \widehat{\beta} \cdot \widehat{\lambda} \end{bmatrix} \end{align*}

This assumption means that I don’t have to consider how learning about one asset affects my beliefs about another asset. In this world, if x_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(\mu_x,\sigma_x^2), then GMM reduces to OLS and \sigma_s^2 = \sfrac{\sigma_z^2}{\beta_n^2} since:

(20)   \begin{align*} r_{n,t} = \beta_n \cdot \lambda  + \beta_n \cdot (x_t - \mu_x) + z_{n,t} \end{align*}

Evaluating the first order condition then gives:

(21)   \begin{align*} 0 = \sigma_l^2 - \left(\frac{1}{\sigma_{\lambda}^2} + \frac{T}{\sfrac{\sigma_z^2}{\beta_n^2}} \right)^{-2} \cdot \frac{1}{\sfrac{\sigma_z^2}{\beta_n^2}} \end{align*}

Solving for T yields:

(22)   \begin{align*} T &\geq \min\left\{ \, 0, \, \frac{\sigma_z}{\beta_n \cdot \sigma_l} - \frac{\sigma_z^2}{\beta_n^2 \cdot \sigma_{\lambda}^2} \, \right\} \end{align*}

Let’s plug in some values to make sure this formula makes sense. First, notice that if the market price of risk is constant, \lambda_T = \lambda, then \sigma_l = 0 and you should pick T = \infty or as large as possible. Second, notice that if you already know the true \lambda, then \sigma_{\lambda}^2 = 0 and you should pick T = 0. Finally, notice that if the test asset has no exposure to the risk factor, \beta_n = 0, then the equation is undefined since any window length gives you the same amount of information—i.e., none.

Phase Change in High-Dimensional Inference

1. Introduction

In my paper Local Knowledge in Financial Markets (2014), I study a problem where assets have Q \gg 1 different attributes and traders try to identify which K \ll Q of these attributes matter via price changes:

(1)   \begin{align*} \Delta p_n &= p_n - \mathrm{E}[p_n] = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{where} \qquad K = \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}} \notag \end{align*}

with each asset’s exposure to a given attribute given by x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1) and the noise is given by \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2). In the limit as K,N,Q \to \infty, \sfrac{K}{Q} \to 0, (N - K) \cdot \beta \to \infty, and \beta = \sfrac{1}{\sqrt{K}} there exists both a signal opacity bound, N_O, as well as a signal recovery bound, N_R:

(2)   \begin{align*} N_O \sim K \cdot \log \left( \frac{Q}{N_O} \right) \qquad \text{and} \qquad N_R \sim K \cdot \log \left( \frac{Q}{K} \right) \notag \end{align*}

with N_O \leq N_R in units of transactions. I explain what I mean by “\sim” in Section 4 below. These 2 thresholds separate the regions where traders are arbitrarily bad at identifying the shocked attributes (i.e., N < N_O) from the regions where traders can almost surely identify the shocked attributes (i.e., N > N_R). i.e., if traders have seen fewer than N_O transactions, then they have no idea which shocks took place; whereas, if traders have seen more than N_R transactions, then they can pinpoint exactly which shocks took place.

In this post, I show that the signal opacity and recovery bounds become arbitrarily close in a large market. The analysis in this post primarily builds on work done in Donoho and Tanner (2009) and Wainwright (2009).

2. Motivating Example

This sort of inference problem pops up all the time in financial settings. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying a list of recent sales prices, you find yourself a bit surprised. People seem to have changed their preferences for 1 of 7 different amenities: ^{(1)}a 2 car garage, ^{(2)}a third bedroom, ^{(3)}a half-circle driveway, ^{(4)}granite countertops, ^{(5)}energy efficient appliances, ^{(6)}central A/C, or ^{(7)}a walk-in closet? The mystery amenity is raising the sale price of some houses by \beta > 0 dollars. How many sales do you need to see in order to figure out which of the 7 amenities realized the shock?

The answer is 3. How did I arrive at this number? Suppose you found one house with amenities \{1,3,5,7\}, a second house with amenities \{2, 3, 6, 7\}, and a third house with amenities \{4, 5, 6,7\}. The combination of the price changes for these 3 houses reveals exactly which amenity has been shocked. i.e., if only the first house’s price was too high, \Delta p_1 = p_1 - \mathrm{E}[p_1] = \beta, then Chicagoans must have changed their preferences for 2 car garages:

(3)   \begin{equation*}     \begin{bmatrix} \Delta p_1 \\ \Delta p_2 \\ \Delta p_3 \end{bmatrix}      =     \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix}      =      \begin{bmatrix}        1 & 0 & 1 & 0 & 1 & 0 & 1        \\        0 & 1 & 1 & 0 & 0 & 1 & 1        \\        0 & 0 & 0 & 1 & 1 & 1 & 1      \end{bmatrix}     \begin{bmatrix}        \beta \\ 0 \\ \vdots \\ 0      \end{bmatrix} \end{equation*}

By contrast, if \Delta p_1 = \Delta p_2 = \Delta p_3 = \beta, then people must value walk-in closets more than they did a year ago.

Here’s the key point. The problem changes character at N_R = 3 observations. 3 sales is just enough information to answer 7 yes or no questions and rule out the possibility of no change: 7 = 2^3 - 1. N = 4 sales simply narrows your error bars around the exact value of \beta. N = 2 sales only allows you to distinguish between subsets of amenities. e.g., seeing just the first and second houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more… not which one.

Yet, the dimensionality in this toy example can be confusing. There is obviously something different about the problem at N_R = 3 observations, but there is still some information contained in the first N = 2 observations. e.g., even though you can’t tell exactly which attribute realized a shock, you can narrow down the list of possibilities to 2 attributes out of 7. If you just flipped a coin and guessed after seeing N = 2 transactions, you would have an error rate of 50{\scriptstyle \%}. This is no longer true in higher dimensions. i.e., even in the absence of any noise, seeing any fraction (1 - \alpha) \cdot N_R of the required observations for \alpha \in (0,1) will leave you with an error rate that is within a tiny neighborhood of 100{\scriptstyle \%} as the number of attributes gets large.

3. Non-Random Analysis

I start by exploring how the required number of observations, N_R, moves around as I increase the number of attributes in the setting where there is only K = 1 shock and the data matrix is non-random. Specifically, I look at the case where K = 1 and Q = 15. My goal is to build some intuition about what I should expect in the more complicated setting where the data \mathbf{X} is a random matrix. Here, in this simple setting, the ideal data matrix would be (4 \times 15)-dimensional and look like:

(4)   \begin{equation*} \underset{4 \times 15}{\mathbf{X}} =  \left[ \begin{matrix}  1 & 0 & 1 & 0 & 1 & 0 & 1  \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\  0 & 0 & 0 & 1 & 1 & 1 & 1 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \ \ \ \begin{matrix} 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1  \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1  \end{matrix} \right] \end{equation*}

where each column of the data matrix corresponds to a number q=1,2,\ldots,15 in binary.

Let S(N) be a function that eats N observed price changes and spits out the set of possible preference changes that might explain the observed price changes. e.g., if traders only see the 1st transaction, then they can only place the shock in 1 of 2 sets containing 8 attributes each:

(5)   \begin{align*} S(1) &=  \begin{cases} \{ 1, 3, 5, 7, 9, 11, 13, 15 \}         &\text{if } \Delta p_1 = \beta \\ \{ \emptyset, 2, 4, 6, 8, 10, 12, 14 \} &\text{if } \Delta p_1 = 0 \end{cases} \end{align*}

The 2nd transaction then allows traders to split each of these 2 larger sets into 2 smaller ones and place the shock in a set of 4 possibilities:

(6)   \begin{align*} S(2) =  \begin{cases} \{ \emptyset, 4, 8, 12 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 5, 9, 13 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 \end{bmatrix}^{\top} \\ \{ 2, 6, 10, 14 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta \end{bmatrix}^{\top}  \\ \{ 3, 7, 11, 15 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta \end{bmatrix}^{\top}  \end{cases} \end{align*}

With the 3rd transaction, traders can tell that the actual shock is either of 2 possibilities:

(7)   \begin{align*} S(3) =  \begin{cases} \{ \emptyset, 8 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 9 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & 0\end{bmatrix}^{\top} \\ \{ 2, 10 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & 0 \end{bmatrix}^{\top}  \\ \{ 3, 11 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & 0 \end{bmatrix}^{\top}  \\ \{ 4, 12 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & \beta \end{bmatrix}^{\top} \\ \{ 5, 13 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & \beta \end{bmatrix}^{\top} \\ \{ 6, 14 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & \beta \end{bmatrix}^{\top}  \\ \{ 7, 15 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & \beta \end{bmatrix}^{\top}  \end{cases} \end{align*}

The N_R = 4th observation then closes the case against the offending attribute.

Here’s the key observation. Only the absolute difference between (N_R - N) matters when computing the size of the output of S(N). If traders have seen N = (N_R - 1) transaction, then they can tell which subset of 2 = 2^1 attributes has realized a shock. If traders have seen N = (N_R - 2) transactions, then they can tell which subset of 4 = 2^2 attributes has realized a shock. If traders have seen N = (N_R - 3) observations, then they can tell which subset of 8 = 2^3 attributes has realized a shock. Thus, after seeing any number of observations N \leq N_R, traders can place the shock in a set of size 2^{N_R - N}. i.e., a trader has the same amount of information about which attribute has realized a shock in (i) a situation where N_R = 100 and he’s seen N = 99 transactions as in (ii) a situation where N_R = 3 and he’s seen N = 2 transactions.

The probability that traders select the correct attribute after seeing only N \leq N_R observations is given by 2^{-(N_R - N)} assuming uniform priors. Natural numbers are hard to work with analytically, so let’s suppose that traders observe some fraction of the required number of observations N_R. i.e., for some \alpha \in (0,1) traders see N = (1 - \alpha) \cdot N_R observations. We can then perform a change of variables 2^{- (N_R - N)} = e^{- \alpha \cdot \log(2) \cdot N_R} and answer the question: “How much does getting 1 additional observation improve traders’ error rate?”

(8)   \begin{align*} \frac{1}{N_R} \cdot \frac{d}{d\alpha}\left[ \, 2^{- (N_R - N)} \, \right] = - \log(2) \cdot e^{- \alpha \cdot \log(2) \cdot N_R} \end{align*}

I plot this statistic for N_R ranging from 100 to 800 below. When N_R = 100, a trader’s predictive power doesn’t start to improve until he sees 95 transactions (i.e., 95{\scriptstyle \%} of N_R); by contrast, when N_R= 800 a trader’s predictive power doesn’t start to improve until he’s seen N = 792 transactions (i.e., 99{\scriptstyle \%} of N_R). Here’s the punchline. As I scale up the original toy example from 7 attributes to 7 million attributes, traders effectively get 0 useful information about which attributes realized a shock until they come within a hair’s breadth of the signal recovery bound N_R. The opacity and recovery bounds are right on top of one another.

plot--prediction-error

4. Introducing Randomness

Previously, the matrix of attributes was strategically chosen so that the set of N observations that traders see would be as informative as possible. Now, I want to relax this assumption and allow the data matrix to be random with elements x_{n,q} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0,1):

(9)   \begin{align*} \Delta p_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{for each }  n = 1,2,\ldots,N \end{align*}

where \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) denotes idiosyncratic shocks affecting asset n in units of dollars. For a given triplet of integers (K,N,Q) with 0 < K < N < Q, I want to know whether solving the linear program:

(10)   \begin{align*} \widehat{\boldsymbol \beta} = \min \left\Vert {\boldsymbol \beta} \right\Vert_{\ell_1} \qquad \text{subject to} \qquad \mathbf{X} {\boldsymbol \beta} = \Delta \mathbf{p} \end{align*}

recovers the true \boldsymbol \beta when it is K-sparse. i.e., when \boldsymbol \beta has only K non-zero entries K = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}}. Since N < Q the linear system is underdetermined; however, if the level of sparsity is sufficiently high (i.e., K is sufficiently small), then there will be a unique solution with high probability.

First, I study the case where there is no noise (i.e., where \sigma_\epsilon^2 \searrow 0), and I ask: “What is the minimum number of observations needed to identify the true \boldsymbol \beta with probability (1 - \eta) for \eta \in (0,1) using the linear program in Equation (10)?” I remove the noise to make the inference problem as easy as possible for traders. Thus, the proposition below which characterizes this minimum number of observations gives a lower bound. I refer to this number of observations as the signal opacity bound and write it as N_O. The proposition shows that, whenever traders have seen N < N_O observations, I can make traders’ error rate arbitrarily bad (i.e., \eta \nearrow 1) by increasing the number of attributes (i.e., Q \nearrow \infty).

Proposition (Donoho and Tanner, 2009): Suppose \sfrac{K}{N} = \rho, \sfrac{N}{Q} = \delta, and N \geq N_0 with \rho, \delta \in (0,1). The linear program in Equation (10) will recover \boldsymbol \beta a fraction (1 - \eta) if the time whenever:

(11)   \begin{align*} N > 2 \cdot K \cdot \log\left( \frac{Q}{N} \right) \cdot \left( 1 - R(\eta;N,Q) \right)^{-1} \end{align*}

where R(\eta;N,Q) = 2 \cdot \sqrt{\sfrac{1}{N} \cdot \log\left( 4 \cdot \sfrac{(Q + 2)^6}{\eta} \right)}.

Next, I turn to the case where there is noise (i.e., where \sigma_\epsilon^2 > 0), and I ask: “How many observations do traders need to see in order to identify the true \boldsymbol \beta with probability 1 in an asymptotically large market using the linear program in Equation (10)?” Define traders’ error rate after seeing N observations as:

(12)   \begin{align*}   \mathrm{Err}[N] &= \frac{1}{{Q \choose K}} \cdot \sum_{\substack{\mathcal{K} \in \mathcal{Q} \\ |\mathcal{K}| = K}} \mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) \end{align*}

\mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) denotes the probability that the linear program in Equation (10) chooses the wrong subset of attributes (i.e., makes an error) given the true support \mathcal{K} and averaging over not only the measurement noise, {\boldsymbol \epsilon}, but also the choice of the Gaussian attribute exposure matrix, \mathbf{X}. Traders’ error rate is the weighted average of these probabilities over every shock set of size K. Traders identify the true {\boldsymbol \beta} with probability 1 in an asymptotically large market if:

(13)   \begin{align*}   \lim_{\substack{K,N,Q \to \infty \\ \sfrac{K}{Q} \to 0}} \mathrm{Err}[N] &= 0 \end{align*}

Thus, the proposition below which characterizes this number of observations gives an upper bound of sorts. I refer to this number of observations as the signal recovery bound and write it as N_R. i.e., the proposition shows that, whenever traders have seen N > N_R observations, they will be able to recovery \boldsymbol \beta almost surely no matter how large I make the market.

Proposition (Wainwright, 2009): Suppose K,N,Q \to \infty, \sfrac{K}{Q} \to 0, (N - K) \cdot \beta \to \infty, and \beta = \sfrac{1}{\sqrt{K}}, then traders can identify the true {\boldsymbol \beta} with probability 1 in an asymptotically large market if for some constant a > 0:

(14)   \begin{align*} N &> a \cdot K \cdot \log (\sfrac{Q}{K}) \end{align*}

The only cognitive constraint that traders face is that their selection rule must be computationally tractable. Under minimal assumptions a convex optimization program is computationally tractable in the sense that the computational effort required to solve the problem to a given accuracy grows moderately with the dimensions of the problem. Natarajan (1995) explicitly shows that \ell_0 constrained linear programming is NP-hard. This cognitive constraint is really weak in the sense that any selection rule that you might look up in an econometrics or statistics textbook (e.g., forward stepwise regression or LASSO) is going to be computationally tractable. After all, they have to be executed on computers.

5. Discussion

plot--opacity-vs-recovery-bound-absolute-gap

What is really interesting is that the signal opacity bound, N_O, and the signal recovery bound, N_R, basically sit right on top of one another when the market gets large just as you would expect from the analysis in Section 3. The figure above plots each bound on a log-log scale for varying levels of sparsity. It’s clear from the figure that the bounds are quite close. The figure below plots the relative gap between these 2 bounds:

(15)   \begin{align*} \frac{N_R - N_O}{N_R} \end{align*}

i.e., it plots how big the gap is relative to the size of the signal recover bound N_R. For each level of sparsity, the gap is shrinking as I add more and more attributes. This is an identical result as in the figure from Section 3: as the size of the market increases, traders learn next to nothing from each successive observation until they get within an inch of the signal recovery bound. The only difference here is that now there are an arbitrary number of shocks and the data matrix is random.

plot--opacity-vs-recovery-bound-relative-gap