Wavelet Variance

1. Motivation

Imagine you’re a trader who’s about to put on a position for the next month. You want to hedge away the risk in this position associated with daily fluctuations in market returns. One way that you might do this would be to short the S&P 500 since E-mini contracts are some of the most liquid in the world.

plot--sp500-price-volume--24jul2014

Flash_CrashBut… how much of the variation in the index’s returns is due to fluctuations at the daily horizon? e.g., the blue line in the figure to the right shows the minute-by-minute price of the E-mini contract on May 6th, 2010 during the flash crash. Over the course of 4 minutes, the contract price fell 3{\scriptstyle \%}! It then rebounded back to nearly its original position over the next hour. Clearly, if most of the fluctuations in the E-mini S&P 500 contract value is due to shocks on the sub-hour time scale, this contract will do a poor job hedging away daily market risk.

This post demonstrates how to decompose the variance of a time series (e.g., the minute-by-minute returns on the E-mini) into horizon specific components using wavelets. i.e., using the wavelet variance estimator allows you to ask the questions: “How much of the variance is coming from fluctuations on the scale of 16 minutes? 1 hour? 1 day? 1 month?” I then investigate how this wavelet variance approach compares to other methods financial economists might employ such as auto-regressive functions and spectral analysis.

2. Wavelet Analysis

In order to explain how the wavelet variance estimator works, I first need to give a quick outline of how wavelets work. Wavelets allow you to decompose a signal into components that are independent in both the time and frequency domains. This outline will be as bare bones as possible. See Percival and Walden (2000) for an excellent overview of the topic.

Imagine you’ve got a time series of just T = 8 returns:

(1)   \begin{align*} \mathbf{r} = \begin{bmatrix} r_0 & r_1 & r_2 & r_3 & r_4 & r_5 & r_6 & r_7 \end{bmatrix}^{\top} \end{align*}

and assume for simplicity that these returns have mean \mathrm{E}[r_t] = \mu_r = 0. One thing that you might do with this time series is estimate a regression with time fixed effects: r_t = \sum_{t'=0}^7 \vartheta_{t'} \cdot 1_{\{\mathrm{Time}(r_t) = t'\}}. Here is another way to represent the same regression:

(2)   \begin{align*} \begin{bmatrix} r_0 \\ r_1 \\ r_2 \\ r_3 \\ r_4 \\ r_5 \\ r_6 \\ r_7 \end{bmatrix} &= \begin{pmatrix}  1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\  0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\  0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\  0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\  0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\  0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix} \begin{bmatrix} \vartheta_0 \\ \vartheta_1 \\ \vartheta_2 \\ \vartheta_3 \\ \vartheta_4 \\ \vartheta_5 \\ \vartheta_6 \\ \vartheta_7 \end{bmatrix} \end{align*}

It’s really a trivial projection since \vartheta_t = r_t. Call the projection matrix \mathbf{F} for “fixed effects” sot that \mathbf{r} = \mathbf{F}{\boldsymbol \vartheta}.

Obviously, the above time fixed effect model would be a bit of a silly thing to estimate, but notice that the projection matrix \mathbf{F} has an interesting property. Namely, each column is orthonormal:

(3)   \begin{align*}  \langle \mathbf{f}(t) | \mathbf{f}(t') \rangle = \begin{cases} 1 &\text{if } t = t' \\ 0 &\text{else } \end{cases} \end{align*}

It’s orthogonal because \langle \mathbf{f}(t) | \mathbf{f}(t') \rangle = 0 unless t = t'. This requirement implies that each column in the projection matrix is picking up different information about \mathbf{r}. It’s normal because \langle \mathbf{f}(t) | \mathbf{f}(t) \rangle is normalized to equal 1. This requirement implies that the projection matrix is leaving the magnitude of \mathbf{r} unchanged. The time fixed effects projection matrix, \mathbf{F}, compares each successive time period, but you can also think about using other orthonormal bases.

e.g., the Haar wavelet projection matrix compares how the 1st half of the time series differs from the 2nd half, how the 1st quarter differs from the 2nd quarter, how the 3rd quarter differs from the 4th quarter, how the 1st eighth differs from the 2nd eighth, and so on… For the 8 period return time series, let’s denote the columns of the wavelet projection matrix as:

(4)   \begin{align*} \mathbf{w}(3,0) &= \sfrac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix}^{\top} \\ \mathbf{w}(2,0) &= \sfrac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 & 1 & 1 & -1 & -1 & -1 & -1 \end{bmatrix}^{\top} \\ \mathbf{w}(1,0) &= \sfrac{1}{\sqrt{4}} \cdot \begin{bmatrix} 1 & 1 & -1 & -1 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(1,1) &= \sfrac{1}{\sqrt{4}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 1 & -1 & -1 \end{bmatrix}^{\top} \\ \mathbf{w}(0,0) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,1) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 1 & -1 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,2) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \end{bmatrix}^{\top} \\ \mathbf{w}(0,3) &= \sfrac{1}{\sqrt{2}} \cdot \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 \end{bmatrix}^{\top} \end{align*}

and simple inspection shows that each column is orthonormal:

(5)   \begin{align*}  \langle \mathbf{w}(h,i) | \mathbf{w}(h',i') \rangle = \begin{cases} 1 &\text{if } h = h', \; i = i' \\ 0 &\text{else } \end{cases} \end{align*}

Let’s look at a concrete example. Suppose that we want to project the vector:

(6)   \begin{align*} \mathbf{r} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}^{\top} \end{align*}

onto the wavelet basis:

(7)   \begin{align*} \begin{bmatrix} r_0 \\ r_1 \\ r_2 \\ r_3 \\ r_4 \\ r_5 \\ r_6 \\ r_7 \end{bmatrix} &= \begin{pmatrix}  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & \sfrac{1}{\sqrt{4}}  & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & \sfrac{1}{\sqrt{4}}  & 0 & -\sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & -\sfrac{1}{\sqrt{4}} & 0 & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}}  & -\sfrac{1}{\sqrt{4}} & 0 & 0 & -\sfrac{1}{\sqrt{2}} & 0 & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & \sfrac{1}{\sqrt{4}}  & 0 & 0 & \sfrac{1}{\sqrt{2}} & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & \sfrac{1}{\sqrt{4}}  & 0 & 0 & -\sfrac{1}{\sqrt{2}} & 0 \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & -\sfrac{1}{\sqrt{4}} & 0 & 0 & 0 & \sfrac{1}{\sqrt{2}} \\  \sfrac{1}{\sqrt{8}} & -\sfrac{1}{\sqrt{8}} & 0 & -\sfrac{1}{\sqrt{4}} & 0 & 0 & 0 & -\sfrac{1}{\sqrt{2}} \end{pmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5 \\ \theta_6 \\ \theta_7 \end{bmatrix} \end{align*}

What would the wavelet coefficients {\boldsymbol \theta} look like? Well, a little trial and error shows that:

(8)   \begin{align*} {\boldsymbol \theta} = \begin{bmatrix} \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{8}} & \sfrac{1}{\sqrt{4}} & 0 & \sfrac{1}{\sqrt{2}} & 0 & 0 & 0 \end{bmatrix}^{\top} \end{align*}

since this is the only combination of coefficients that satisfies both r_0 = 1:

(9)   \begin{align*} 1 &=  r_0 \\ &= \frac{1}{\sqrt{8}} \cdot w_0(3,0) + \frac{1}{\sqrt{8}} \cdot w_0(2,0) + \frac{1}{\sqrt{4}} \cdot w_0(1,0) + \frac{1}{\sqrt{2}} \cdot w_0(0,0) \\ &= \frac{1}{8} + \frac{1}{8} + \frac{1}{4} + \frac{1}{4} \end{align*}

and r_t = 0 for all t > 0.

What’s cool about the wavelet projection is that the coefficients represent effects that are isolated in both the frequency and time domains. The index h=0,1,2,3 denotes the \log_2 length of the wavelet comparison groups. e.g. the 4 wavelets with h=0 compare 2^0 = 1 period increments: the 1st period to the 2nd period, the 3rd period to the 4th period, and so on… Similarly, the wavelets with h=1 compare 2^1 = 2 period increments: the 1st 2 periods to the 2nd 2 periods and the 3rd 2 periods to the 4th 2 periods. Thus, the h captures the location of the coefficient in the frequency domain. The index i=0,\ldots,I_h signifies which comparison groups at horizon h we are looking at. e.g., when h=0, there are I_0 = 4 = \sfrac{8}{2^{0+1}} different comparisons to be made. Thus, the i captures the location of the coefficient in the time domain.

3. Wavelet Variance

With these basics in place, it’s now easy to define the wavelet variance of a time series. First, I massage the standard representation of a series’ variance a bit. The variance of our 8 term series is defined as:

(10)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot \sum_t r_t^2  \end{align*}

since \mu_r = 0. Using the tools from the section above, let’s rewrite \mathbf{r} = \mathbf{W}{\boldsymbol \theta}. This means that the variance formula becomes:

(11)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot \mathbf{r}^{\top} \mathbf{r} =  \frac{1}{T} \cdot \left( \mathbf{W} {\boldsymbol \theta} \right)^{\top} \left( \mathbf{W} {\boldsymbol \theta} \right) \end{align*}

But I know that \mathbf{W}^{\top} \mathbf{W} = \mathbf{I} since each of the columns is orthonormal. Thus:

(12)   \begin{align*}  \sigma_r^2 &= \frac{1}{T} \cdot {\boldsymbol \theta}^{\top} {\boldsymbol \theta} = \frac{1}{T} \cdot \sum_{h,i} \theta(h,i)^2 \end{align*}

This representation gives the variance of a series as an average of squared wavelet coefficients.

The sum of the squared wavelet coefficients at each horizon, h, is then an interesting object:

(13)   \begin{align*} V(h) &= \frac{1}{T} \cdot \sum_{i=0}^{I_h} \theta(h,i)^2 \end{align*}

since V(h) denotes the fraction of the total variance of the time series explained by comparing successive periods of length 2^h. I refer to V(h) as the wavelet variance of a series at horizon h. The sum of the wavelet variances at each horizon gives total variance:

(14)   \begin{align*} \sum_{h=0}^H V(h) &= \sigma_r^2 \end{align*}

4. Numerical Example

Let’s take a look at how the wavelet variance of a time series behaves out in the wild. Here’s the code I used to create the figures: . Specifically, let’s study the simulated data plotted below which consists of 63 days of minute-by-minute return data with day-specific shocks:

(15)   \begin{align*} r_t &= \mu_{r,t} + \sigma_r \cdot \epsilon_t \qquad \text{with} \qquad \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1) \end{align*}

where the volatility of the process is given by \sigma_r = 0.01{\scriptstyle \mathrm{bp}/\sqrt{\mathrm{min}}} and there is a 5{\scriptstyle \%} probability of realizing a \mu_{r,t} = \pm 0.001{\scriptstyle \mathrm{bp}/\mathrm{min}} shock on any given day. The 4 days on which the data realized a shock are highlighted in red. These minute-by-minute figures amount to a 0{\scriptstyle \%/\mathrm{yr}} annualized return and a 31{\scriptstyle \%/\mathrm{yr}} annualized volatility.

plot--why-use-wavelet-variance--daily-shocks--time-series--25jul2014

The figure below then plots the wavelet coefficients, {\boldsymbol \theta}, at each horizon associated with this time series. A trading day is 6.5 \times 60 = 390{\scriptstyle \mathrm{min}}, so notice the spikes in the coefficient values in the h=6,7,8 panels near the day-specific shock dates corresponding to comparing successive 64, 128, and 256 minute intervals. The remaining variation in the coefficient levels comes from the underlying white noise process \epsilon_t. Because the break points in the wavelet projection affect the estimated coefficients, each data point in the plot actually represents the average of the coefficient estimates \theta_t(h,i) at a given point of time for all possible starting dates. See Percival and Walden (2000, Ch. 5) on the maximal overlap discrete wavelet transform for details.

plot--why-use-wavelet-variance--daily-shocks--wavelet-coefficients--25jul2014

Finally, I plot the \log of the wavelet variance at each horizon h for both the simulated return process (red) and a white noise process with an identical mean and variance (blue). Note that I’ve switched from \log_2 to \log_e on the x-axis here, so a spike in the amount of variance at h=6 corresponds to a spike in the amount of variance explained by successive e^{6} \approx 400{\scriptstyle \mathrm{min}} increments. This is exactly what you’d expect for day-specific shocks which have a duration of 390{\scriptstyle \mathrm{min}} as indicated by the vertical gray line. The wavelet variance of an appropriately scaled white noise process gives a nice comparison group. To see why, note that for covariance stationary processes like white noise, the wavelet variance at a particular horizon is related to the power spectrum as follows:

(16)   \begin{align*} V(h) &\approx 2 \cdot \int_{\sfrac{1}{2^{h+1}}}^{\sfrac{1}{2^h}} S(f) \cdot df \end{align*}

Thus, the wavelet variance of white noise should follow a power law with:

(17)   \begin{align*} V(h) &\propto 2^{-h} \end{align*}

giving a nice smooth reference point in plots.

plot--why-use-wavelet-variance--daily-shocks--wavelet-variance--25jul2014

5. Comparing Techniques

I conclude by considering how the wavelet variance statistic compares to other ways that a financial economist might look for horizon specific effects in data. I consider 2 alternatives: auto-regressive models and spectral density estimators. First, consider estimating the auto-regressive model below with lags \ell = 1,2,\ldots,L:

(18)   \begin{align*} r_t &= \sum_{\ell=1}^L C(\ell) \cdot r_{t-\ell} + \xi_t \qquad \text{where} \qquad \xi_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\xi}^2) \end{align*}

The left-most panel of the figure below reports the estimated values of C(\ell) for lags \ell = 1,2,\ldots,420 using the simulated data (red) as well as a scaled white noise process (blue). Just as before, the vertical grey line denotes the number of minutes in a day. There is no meaningful difference between the 2 sets of coefficients. The reason is that the day-specific shocks are asynchronous. They aren’t coming at regular intervals. Thus, no obvious lag structure can emerge from the data.

plot--why-use-wavelet-variance--daily-shocks--analysis--25jul2014

Next, let’s think about estimating the spectral density of \mathbf{r}. This turns out to be the exact same exercise as the auto-regressive model estimation in different clothing. As shown in an earlier post, it’s possible to flip back and forth between the coefficients of an \mathrm{AR}(L) process and its spectral density via the relationship:

(19)   \begin{align*} S(f) &= \frac{\sigma_{\epsilon}^2}{\left( \, 1 - \sum_{\ell=1}^L C(\ell) \cdot e^{-i \cdot 2 \cdot \pi \cdot f \cdot \ell} \, \right)^2} \end{align*}

This one-to-one mapping between the frequency domain and the time domain for covariance stationary processes is known as the Wiener–Khinchin theorem with \sigma_x^2 = \int_{-\sfrac{1}{2}}^{\sfrac{1}{2}} S(f) \cdot df. Thus, the spectral density plot just reflects the same random noise as the auto-regressive model coefficients because of the same issue with asynchrony. The most interesting features of the middle panel occur at really high frequencies which have nothing to do with the day-specific shocks.

Here’s the punchline. The wavelet variance is the only estimator of the 3 which can identify horizon-specific contributions to a time series’ variance which are not stationary.

WSJ Article Subject Tags

1. Motivation

Screen Shot 2014-07-18 at 5.44.43 PM

This post investigates the distribution of subject tags for Wall Street Journal articles that mention S&P 500 companies. e.g., a December 2009 article entitled, When Even Your Phone Tells You You’re Drunk, It’s Time to Call a Taxi, about a new iPhone app that alerted you when you were too drunk to drive had the meta data to the right. The subject tags are essentially article keywords. I collect every article that references an S&P 500 company over the period from 01/01/2008 to 12/31/2012. It is an appendix to my paper, Local Knowledge in Financial Markets.

I find that there is substantial heterogeneity in how many different topics people write about when discussing a company even after controlling for the number of total articles. e.g., there were 87 articles in the WSJ referencing Garmin (GRMN) and 81 articles referencing Sprint (S); however, while there were only 87 different subject tags used in the articles about Garmin, there were 716 different subject tags used in the articles about Sprint! This finding is consistent with the idea that some firms face a much wider array of shocks than others. i.e., the width of the market matters.

2. Data Collection

The data are hand-collected from the ProQuest newspaper archive by an RA. Data collection process for an example company, Agilent Technologies (A), is summarized in the 3 figures below. First, we searched for each company included in the S&P 500 from 01/01/2008 to 12/31/2012 [list]. Then, after each query, we restricted the results to articles found in the WSJ. Finally, we downloaded the articles and meta data in HTML format.

After the RA collected all of the data, I used a Python script to parse the resulting HTML files into a form I could manage in R. Roughly 4000 of the downloaded articles were duplicates resulting from the WSJ publishing the same article in different editions. I identify these observations by checking for articles published on the same day with identical word counts about the same companies. I tried using Selenium to automate the data collection process, but the ProQuest web interface proved too finicky.

3. Summary Statistics

My data set contains 106{\scriptstyle \mathrm{k}} articles over 5 years about 542 companies. Many articles reference multiple S&P 500 companies. The figure below plots the total number of articles in the database per month. There is a steady downward trend. The first part of the sample was the height of the financial crisis, so this makes sense. As markets have calmed down, journalists have devoted fewer articles to corporate news relative to other things such as politics and sports.

plot--wsj-articles-about-sp500-companies-per-month--18jul2014

Articles are not evenly distributed across companies as shown by the figure below. While the median company is only referenced in 21 articles over the sample period, the 5 most popular companies (United Parcel Service [UPS], Apple [AAPL], Goldman Sachs [GS], Citibank [C], and Ford [F]) are all referenced in at least 1922 different articles a piece. By comparison, the least popular 1{\scriptstyle \%} of companies are mentioned in only 1 article in 5 years.

plot--articles-per-firm--18jul2014

Counting subject tags is a bit less straight-forward than counting articles. I not count tags that are specific to the WSJ rather than the company. e.g., tags containing “(wsj)” flagging daily features like “Abreast of the market (wsj).” I also remove missing subjects. It’s worth pointing out that sometimes the meta data for an article doesn’t contain any subject information. After restrictions, the data contain 10{\scriptstyle \mathrm{k}} unique subject tags.

The distribution of subject tag counts per month is similar to that of article counts as shown in the figure below but with a less pronounced downward trend. To create this figure, I count the number of unique subject tags used each month. e.g., if “technology shock” is used 2 times in Jan 2008, then this counts as 1 of the 1591 tags used in this month; whereas, if “technology shock” is then used again on Feb 1st 2008, then I count this 3rd observation towards the total in February. Thus, the sum of the points in the time series will exceed 10{\scriptstyle \mathrm{k}}. Also, note that different articles can have identical subject tags.

plot--wsj-subjects-about-sp500-companies-per-month--18jul2014

As shown in the figure below, the distribution of subject tags used to describe articles about each company is less skewed than the actual article count for each company. There are 179 different subject tags used in the 21 articles about the median S&P 500 company during the sample period. The most tagged companies have 10 times as many subjects as the median firm; whereas, the most written about companies are referenced in 100 times as many articles as the median firm.

plot--subject-tags-per-firm--18jul2014

4. Articles per Tag

In order for the distribution of tags per company to be less skewed than the distribution of articles per company, it’s got to be the case that some tags are used in lots of articles. This is exactly what’s going on in the data. The figure below shows that the median subject tag is used in only 3 articles and the bottom 25{\scriptstyle \%} of tags are used in only 1 article; however, the top 1{\scriptstyle \%} of tags are used in 466 articles or more. e.g., there are roughly 100 tags out of the 10{\scriptstyle \mathrm{k}} unique subject tags in my data set that are used 500 times are more. Likewise, there are well over 3000 that are used only once!

plot--articles-per-subject-tag--18jul2014

This fact strongly supports the intuition that companies–even huge companies like those in the S&P 500—are constantly hit with new and different shocks. Traders have to figure out which aspect of the company matters. This is clearly not an easy problem to solve. Lot’s of ideas are thrown around. Many of them must be eitehr short lived or wrong. Roughly 1 out of every 4 topics worth discussing is only worth discussing once.

5. Coverage Depth

I conclude this post by looking at the variation in the number of subject tags across firms with a similar number of articles. e.g., I want to know if there are pairs of firms which journalist spend roughly the same amount of time talking about, but which get covered in very different ways. It turns out there are. The Garmin and Sprint example from the introduction is one such case. The figure below shows that there are many more. i.e., it shows that companies that are referenced in more articles also have more subject tag descriptors, but conditional on the number of articles there is still a lot of variation. The plot is on a \log_{10} \times \log_{10} scale, so a 1 tick vertical movement means a factor of 10 difference between the number of tags for 2 firms with similar article counts. Looking at the figure, it’s clear that this sort of variation is the norm.

plot--articles-vs-subjects--18jul2014

Randomized Market Trials

1. Motivation

How much can traders learn from past price signals? It depends on what kind of assets sell. Suppose that returns are (in part) a function of K = \Vert {\boldsymbol \alpha} \Vert_{\ell_0} different feature-specific shocks:

(1)   \begin{align*} r_n &= \sum_{q=1}^Q \alpha_q \cdot x_{n,q} + \epsilon_n \qquad \text{with} \qquad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}

If {\boldsymbol \alpha} is identifiable, then different values of {\boldsymbol \alpha} have to produce different values of r_n. This is only the case if assets are sufficiently different from one another. e.g., consider the analogy to randomized control trials. In an RCT, randomizing which subjects get thrown in the treatment and control groups makes it exceptionally unlikely that, say, all the people in the treatment group will by chance happen to all have some other common trait that actually explains their outcomes. Similarly, randomizing which assets get sold makes makes it exceptionally unlikely that 2 different choices of {\boldsymbol \alpha} and {\boldsymbol \alpha}' can explain the observed returns.

This post sketches a quick model relating this problem to housing prices. To illustrate, imagine N = 4 houses have sold at a discount in a neighborhood that looks like this:

tract-housing

The shock might reflect a structural change in the vacation home market whereby there is less disposable income to buy high end units—i.e., a permanent shift. Alternatively, the shock might have been due to a couple of out-of-town second house buyers needing to sell quickly—i.e., a transient effect. The houses in the picture above are all vacation homes of a similar quality with owners living in LA. Since there is so little variation across units, both these explanations are observationally equivalent. Thus, the asset composition affects how informative prices are in an important way. The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks.

2. Toy Model

Suppose you’ve seen N sales in the area. Most of the prices looked just about right, but some of the houses sold for a bit more than you would have expected and some sold for a bit less than you would have expected. You’re trying to decide whether or not to buy the (N+1)th house if the transaction costs are \mathdollar c today:

(2)   \begin{align*} U &= \max_{\{\text{Buy},\text{Don't}\}} \left\{ \, \mathrm{E}\left[ r_{N+1} \right] - \frac{\gamma}{2} \cdot \mathrm{Var}\left[ r_{N+1} \right] - c, \, 0 \, \right\} \end{align*}

You will buy the house if your risk adjusted expectation of its future returns exceeds the transaction costs, \mathrm{E}[r_{N+1}] - \sfrac{\gamma}{2} \cdot \mathrm{Var}[r_{N+1}] \geq c.

This problem hinges on your ability to estimate {\boldsymbol \alpha}. What’s the best you could ever hope to do? Well, suppose you knew which K features mattered ahead of time and the elements of \mathbf{X} were given by x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{K}). In this setting, your average estimation error per relevant feature is given by:

(3)   \begin{align*} \Omega^\star = \mathrm{E}\left[ \, \frac{1}{K} \cdot \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 \, \right] &= \frac{K \cdot \sigma_{\epsilon}^2}{N} \end{align*}

i.e., it’s as if you ran an OLS regression of the N price changes on the K relevant columns of \mathbf{X}. You will buy the house if:

(4)   \begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( \frac{K + N}{N}  \right) \cdot \sigma_{\epsilon}^2 &\geq c \end{align*}

In the real world, however, you generally don’t know which K features are important ahead of time and each house’s amenities are not taken as an iid draw. Instead, you must solve \ell_1-type inference problem:

(5)   \begin{align*} \widehat{\boldsymbol \alpha} &= \arg \min_{\boldsymbol \alpha} \sum_{n=1}^N \left( r_n - \mathbf{x}_n^{\top} {\boldsymbol \alpha} \right)^2 \qquad \text{s.t.} \qquad \left\Vert {\boldsymbol \alpha} \right\Vert_{\ell_1} \leq \lambda \cdot \sigma_{\epsilon} \end{align*}

with a correlated measurement matrix, \mathbf{X}, using something like LASSO. In this setting, you face feature selection risk. i.e., you might focus on the wrong causal explanation for the past price movements. If \Omega^{\perp} denotes your estimation error when each of the elements x_{n,q} are drawn independently and \Omega denotes your estimation error in the general case when \rho(x_{n,q},x_{n',q}) \neq 0, then:

(6)   \begin{align*} \Omega^{\star} \leq \Omega^{\perp} \leq \Omega \end{align*}

Since your estimate of \widehat{\boldsymbol \alpha} is unbiased, feature selection risk will simply increase \mathrm{Var}[r_{N+1}] making it less likely that you will buy the house in this stylized model:

(7)   \begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( K \cdot \Omega + \sigma_{\epsilon}^2 \right) &\geq c \end{align*}

More generally, it will make prices slower to respond to shocks and allow for momentum.

3. Matrix Coherence

Feature selection risk is worst when assets all have really correlated features. Let \mathbf{X} denote the (N \times Q)-dimensional measurement matrix containing all the features of the N houses that have already sold in the market:

(8)   \begin{align*} \mathbf{X} &= \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,Q} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,Q} \\ \vdots  & \vdots  & \ddots & \vdots  \\ x_{N,1} & x_{N,2} & \cdots & x_{N,Q} \\ \end{bmatrix} \end{align*}

Each row represents all of the features of the nth house, and each column represents the level to which the N assets display a single feature. Let \widetilde{\mathbf{x}}_q denote a unit-normed column from this measurement matrix:

(9)   \begin{align*} \widetilde{\mathbf{x}}_q &= \frac{\mathbf{x}_q}{\sqrt{\sum_{n=1}^N x_{n,q}^2}} \end{align*}

I use a measure of the coherence of \mathbf{X} to quantify the extent to which all of the assets in a market have similar features.

(10)   \begin{align*} \mu(\mathbf{X}) &= \max_{q \neq q'} \left\vert \left\langle \widetilde{\mathbf{x}}_q, \widetilde{\mathbf{x}}_{q'} \right\rangle \right\vert \end{align*}

e.g., the coherence of a matrix with x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N}) is roughly \sqrt{2 \cdot \log(Q)/N} corresponding to the red line in the figure below. As the correlation between elements in the same column increases, the coherence increases since different terms in the above cross-product are less likely to cancel out.

plot--mutual-coherence-gaussian-matrix--15jul2014

4. Selection Risk

There is a tight link between the severity of the selection risk and how correlated asset features are. Specifically, Ben-Haim, Eldar, and Elad (2010) show that if

(11)   \begin{align*} \alpha_{\min} \cdot \left( 1 - \{2 \cdot K - 1\} \cdot \mu(\mathbf{X}) \right) &\geq 2 \cdot \sigma_{\epsilon} \cdot \sqrt{2 \cdot (1 + \xi) \cdot \log(Q)} \end{align*}

for some \xi > 0, then:

(12)   \begin{align*} \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 &\leq \frac{2 \cdot (1 + \xi)}{(1 - (K-1)\cdot \mu(\mathbf{X}))^2} \times K \cdot \sigma_{\epsilon}^2 \cdot \log(Q) = \Omega \end{align*}

with probability at least:

(13)   \begin{align*} 1 - Q^{-\xi} \cdot \left( \, \pi \cdot (1 + \xi) \cdot \log(Q) \, \right)^{-\sfrac{1}{2}} \end{align*}

where \alpha_{\min} = |\arg \min_{q \in \mathcal{K}} \alpha_q|. Let’s plug in some numbers. If \alpha_{\min} = 0.10 and \sigma_{\epsilon} = 0.05, then the result means that \Vert \widehat{\boldsymbol \alpha} - {\boldsymbol \alpha} \Vert_{\ell_2}^2 is less than 0.185 \times K \cdot \log(Q) with probability \sfrac{3}{4}.

There are a couple of things worth pointing out here. First, the recovery bounds only hold when \mathbf{X} is sufficiently incoherent:

(14)   \begin{align*} \mu(\mathbf{X}) < \frac{1}{2 \cdot K - 1} \end{align*}

i.e., when the assets are too similar, we can’t learn anything concrete about which amenity-specific shocks are driving the returns. Second, the free parameter \xi > 0 links the probability of seeing an error rate outside the bounds, p, to the number of amenities that houses have:

(15)   \begin{align*} \xi &\approx \frac{\log(\sfrac{1}{p}) - \frac{1}{2} \cdot \log\left[ \pi \cdot \log Q \right]}{\sfrac{1}{2} + \log(Q)} \end{align*}

If you want to lower this probability, you need to either use a larger constant or decrease the number of amenities. For \xi large enough we can effectively regard the error bounds as the variance. Importantly, this quantity is increasing in the coherence of the measurement matrix. i.e., when assets are more similar, I am less sure that I am drawing the correct conclusion from past returns.

5. Empirical Predictions

The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks. e.g., imagine studying the price paths of 2 neighborhoods, A and B, which have houses of the exact same value, \mathdollar v. In neighborhood A, each of the houses has a very different collection of amenities whose values sum to \mathdollar v; whereas, in neighborhood B, each of the houses has the exact same amenities whose values sum to \mathdollar v. e.g., you can think about neighborhood A as pre-war and neighborhood B as tract housing. The theory says that the price of houses in the neighborhood B should respond slower to amenity-specific value shocks because houses have more correlated amenities—i.e., \Omega is larger. As a result, home prices in neighborhood B should also display more momentum… though this is not in the toy model above.

Notes: Ang, Hodrick, Xing, and Zhang (2006)

1. Introduction

In this post I work through the main results in Ang, Hodrick, Xing, and Zhang (2006) which shows not only that i) stocks with more exposure to changes in aggregate volatility have lower average excess returns, but also that ii) stocks with more idiosyncractic volatility relative to the Fama and French (1993) 3 factor model have lower excess returns. The first result is consistent with existing asset pricing theories; whereas, the second result is at odds with almost any mainstream asset pricing theory you might write down. Idiosyncratic risk should not be priced. This paper together with Campbell, Lettau, Malkiel, and Xu (2001) (see my earlier post) set off an investigation into the role of idiosyncratic risk in determining asset prices. One possibility is that idiosyncratic risk is just a proxy for exposure to aggregate risk. i.e., perhaps it’s the firms with the highest exposure to aggregate return volatility that also have the highest idiosyncratic volatility. Interestingly, Ang, Hodrick, Xing, and Zhang (2006) show that this is not the case via a double sort on both aggregate and idiosyncratic volatility exposure giving evidence that these are 2 separate risk factors. The code I use to replicate the results in Ang, Hodrick, Xing, and Zhang (2006) and create the figures can be found here.

2. Theoretical Motivation

The discount factor view of asset pricing says that:

(1)   \begin{align*} 0 = \mathrm{E}[m \cdot r_n] \quad \text{for all } n=1,2,\ldots,N \end{align*}

where \mathrm{E}(\cdot) denotes the expectation operator, m denotes the stochastic discount factor, and r_n denotes asset n‘s excess return. Equation (1) reads: “In the absence of margin requirements and transactions costs, it costs you \mathdollar 0 today to borrow at the riskless rate, buy a stock, and hold the position for 1 period.” Asset pricing theories explain why average excess returns, \mathrm{E}[r_n], vary across assets even though they all have the same price today by construction (see my earlier post).

Suppose each asset’s excess returns are a function of a risk factor x, \mathrm{R}_n(x), and noise, z_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_z^2):

(2)   \begin{align*} r_n  &= \mathrm{R}_n(x) + z_n \\ &= \mathrm{R}_n(\mu_x) + \mathrm{R}_n'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{R}_n''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + z_n \\ &\approx \alpha_n + \beta_n \cdot (x - \mu_x) + \frac{\gamma_n}{2} \cdot (x - \mu_x)^2 + z_n \end{align*}

where I assume for simplicity that the only risk factor is the value-weighted excess return on the market so that \mu_x \approx 6{\scriptstyle \%/\mathrm{yr}} and \sigma_x \approx 16{\scriptstyle \%/\mathrm{yr}}. I use a Taylor expansion to linearize the function \mathrm{R}_n(x) around the point x = \mu_x and assume \mathrm{O}(x - \mu_x)^3 terms are negligible so \mathrm{E}[r_n] = \alpha_n + \sfrac{\gamma_n}{2} \cdot \sigma_x^2 and \mathrm{Var}[r_n] = \beta_n^2 \cdot \sigma_x^2 + \sigma_z^2. This means that if the excess return on the market is \sfrac{16{\scriptstyle \%}}{\sqrt{252}} \approx 1{\scriptstyle \%/\mathrm{day}} larger than expected, then asset n‘s expected excess returns will be \beta_n{\scriptstyle \%} larger.

Any asset pricing theory says that each asset’s expected excess return should be proportional to how much the asset comoves with the risk factor, x:

(3)   \begin{align*} \mathrm{E}[r_n]  = \alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2 =  \underbrace{\text{Constant} \times \beta_n}_{\text{Predicted}} \end{align*}

where the constant of proportionality, \text{Constant} = c \cdot (\sfrac{\mathrm{Var}[m]}{\mathrm{E}[m]}), depends on the exact asset pricing model. Equation (3) says that if you ran a regression of each stock’s excess returns on the aggregate risk factor:

(4)   \begin{align*} r_{n,t} = \widehat{\alpha}_n + \widehat{\beta}_n \cdot x_t + \mathit{Error}_{n,t} \end{align*}

then the estimated intercept for each stock should be:

(5)   \begin{align*} \widehat{\alpha}_n = \frac{\gamma_n}{2} \cdot \sigma_x^2 - \beta_n \cdot \mu_x \end{align*}

Thus, each stock’s average excess returns may well be related to its exposure to aggregate volatility since \sigma_x shows up in the expression for \widehat{\alpha}_n; however, idiosyncratic volatility, \sigma_z, better not be priced since it shows up nowhere above.

3. Aggregate Volatility

Ang, Hodrick, Xing, and Zhang (2006) show that stocks with more exposure to aggregate volatility have lower average excess returns. i.e., that the coefficient \gamma_n < 0. The authors actually look at each stock’s exposure to changes in aggregate volatility. To see how this changes the math, consider rewriting the intercept above as:

(6)   \begin{align*} \widehat{\alpha}_n = \mathrm{A}_n(\Delta \sigma_x) &= \alpha_n + \frac{\gamma_n}{2} \cdot \left(\langle \sigma_x \rangle + \Delta \sigma_x \right)^2 \end{align*}

Using this formulation, we can look at how perturbing \mathrm{A}_n(\Delta \sigma_x) around its mean with some small \Delta \sigma_x will impact the estimated intercept:

(7)   \begin{align*} \mathrm{A}_n(\Delta \sigma_x) &= \mathrm{A}_n(0) + \mathrm{A}_n'(0) \cdot \Delta \sigma_x + \cdots \\ &\approx \left[ \alpha_n + \frac{\gamma_n}{2} \cdot \langle\sigma_x\rangle^2 \right] + \gamma_n \cdot \langle\sigma_x\rangle \cdot \Delta \sigma_x \end{align*}

Since \langle \sigma_x \rangle > 0 by definition, (\sfrac{\gamma_n}{2}) \cdot \langle \sigma_x \rangle^2 and \gamma_n \cdot \langle \sigma_x \rangle will have the same sign. Thus, testing for whether exposure to changes in aggregate volatility is priced is tantamount to testing for whether exposure to aggregate volatility is priced.

The authors proceed in 5 steps. First, they calculate the changes in aggregate volatility time series using changes in the daily options implied volatility:

(8)   \begin{align*} \Delta \sigma_{x,d+1} = \mathit{VXO}_{d+1} - \mathit{VXO}_d \qquad \text{with} \qquad  \mathrm{E}[\Delta \sigma_{x,d+1}] = 0.01{\scriptstyle \%}, \, \mathrm{StD}[\Delta \sigma_{x,d+1}] = 2.65{\scriptstyle \%} \end{align*}

If the VXO is 4.33{\scriptstyle \%}, then options markets expect the S&P 100 to move up or down 4.33{\scriptstyle \%} over the next 30 calendar days. The authors use the VXO contract price rather than the VIX contract price because it has a longer time series dating back to 1986. The only difference between the 2 contracts is that the VXO quotes the options implied volatility on the S&P 100; whereas, the VIX quotes the options implied volatility on the S&P 500. Daily changes in the 2 contract prices have a correlation of 0.81 over the sample period from January 1986 to December 2012 as shown in the figure below.

plot--vix-vs-vxo-daily-data--04may2014

Second, the authors compute each stock’s exposure to changes in aggregate volatility by running a regression for each stock n \in \{1,2,\ldots,N\} using the daily data in month (m-1):

(9)   \begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\beta}_{n} \cdot x_d + \widehat{\gamma}_{n} \cdot \Delta \sigma_{x,d} + \mathit{Error}_{n,d} \end{align*}

Estimated coefficients are related to underlying deep parameters by:

(10)   \begin{align*} \widehat{\alpha}_n &= \frac{\gamma_n}{2} \cdot \langle \sigma_x \rangle^2 - \beta_n \cdot \mu_x \\ \widehat{\beta}_n &= \beta_n \\ \widehat{\gamma}_n &= \gamma_n \cdot \langle \sigma_x \rangle \end{align*}

The daily market excess return, x_d, is the excess return on the CRSP value-weighted market index. I include AMEX, NYSE, and NASDAQ stocks with \geq 17 daily observations in month (m-1) in my universe of N stocks.

plot--aggregate-volatility-portfolio-cumulative-returns

Third, the authors sort the N stocks satisfying the data constraints in month (m-1) into 5 value-weighted portfolios based on their estimated \widehat{\gamma}_{n}. Note that because the factor \langle \sigma_x \rangle is common to all stocks in month (m-1), this sort effectively organizes stocks by their true exposure to aggregate volatility, \gamma_n. For each portfolio j \in \{\text{L},2,3,4,\text{H}\} with j = \text{L} denoting the stocks with the lowest aggregate volatility exposure and j = \text{H} denoting the stocks with the highest aggregate volatility exposure, the authors then calculate the daily portfolio returns in month m. The figure above shows the cumulative returns to each of these 5 portfolios. It reads that if you invested \mathdollar 1 in the low aggregate volatility exposure portfolio in January 1986, then you would have over \mathdollar 200 more dollars in December 2012 than if you had invested that same \mathdollar 1 in the high aggregate volatility exposure portfolio. What’s more, each portfolio’s exposure to the excess return on the market is not explaining its performance. The figure below reports the estimated intercepts for each j \in \{\text{L},2,3,4,\text{H}\} from the regression:

(11)   \begin{align*} r_{j,m} = \widehat{\alpha}_j + \widehat{\beta}_j \cdot x_m + \mathit{Error}_{j,m} \end{align*}

and indicates that abnormal returns are decreasing in the portfolio’s exposure to aggregate volatility.

plot--ahxz06-table-1--capm-alphas

Fourth, in order to test whether the spread in portfolio abnormal returns is actually explained by contemporaneous exposure to aggregate volatility, the authors then create an aggregate volatility factor mimicking portfolio. They estimate the regression below using the daily excess returns on each of the 5 aggregate volatility exposure portfolios in each month m:

(12)   \begin{align*} \Delta \sigma_{x,d} = \widehat{\kappa} + \sum_{j=\text{L}}^{\text{H}} \widehat{\lambda}_{j} \cdot r_{j,d} + \mathit{Error}_d \end{align*}

and store the parameter estimates for \begin{bmatrix} \widehat{\lambda}_1 & \widehat{\lambda}_2 & \widehat{\lambda}_3 & \widehat{\lambda}_4 & \widehat{\lambda}_5 \end{bmatrix}^{\top}. They then define the factor mimicking portfolio return at daily horizon in month m as:

(13)   \begin{align*}  f_d = \sum_{j=\text{L}}^{\text{H}} \widehat{\lambda}_{j} \cdot r_{j,d} \end{align*}

The figure below plots the factor mimicking portfolio returns against the underlying changes in aggregate volatility at the monthly level. The 2 data series line up relatively closely; however, the factor mimicking portfolio is much too volatile during crises such as Black Monday in 1987.

plot--aggregate-volatility-factor

Fifth and finally, the authors check whether or not each of the 5 aggregate volatility portfolio’s returns are positively correlated with contemporaneous movements in the aggregate volatility factor mimicking portfolio at the monthly horizon. To do this, they cumulate up daily excess returns on the factor mimicking portfolio and the aggregate volatility exposure sorted portfolios to get monthly returns:

(14)   \begin{align*} f_m &= \sum_{d=1}^{22} f_d \\ r_{j,m} &= \sum_{d=1}^{22} r_{j,d} \quad \text{for all } j \in \{\text{L},2,3,4,\text{H}\} \end{align*}

Then, they run the regression below at a monthly horizon over full sample:

(15)   \begin{align*} r_{j,m} = \widehat{\zeta}_j + \widehat{\eta}_j \cdot x_m  + \widehat{\theta}_j \cdot f_m + \mathit{Error}_{j,m} \end{align*}

I report the estimated \widehat{\theta}_j coefficients in the figure below. Consistent with the idea that exposure to aggregate volatility is driving the disparate excess returns of the 5 test portfolios, I find that each portfolio loads positively on monthly movements in the factor mimicking portfolio.

plot--ahxz06-table-1--factor-loadings

4. Idiosyncratic Volatility

Ang, Hodrick, Xing, and Zhang (2006) also show that stocks with more idiosyncratic volatility have lower average excess returns. This should not be true under the standard theory outlined in Section 2 above. To measure idiosyncratic volatility, the authors run the regression below at the daily level in month (m-1) for each stock n = 1,2,\ldots,N:

(16)   \begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \cdot \mathbf{x}_d + \mathit{Error}_{n,d} \end{align*}

where the risk factors are the excess return on the value weighted market portfolio, the excess return on a size portfolio, and the excess return on a value portfolio as dictated by Fama and French (1993):

(17)   \begin{align*} \mathbf{x}_d^{\top} = \begin{bmatrix} r_{\mathrm{Mkt},d} & r_{\mathrm{SmB},d} & r_{\mathrm{HmL},d} \end{bmatrix} \end{align*}

For each stock listed on the AMEX, NYSE, or NASDAQ stock exchange with \geq 17 daily observations in month (m-1), the authors then calculate the measure of idiosyncratic volatility below:

(18)   \begin{align*} \sigma_{z,n} &= \mathrm{StD}[\mathit{Error}_{n,d}] \end{align*}

plot--idiosyncratic-volatility-portfolio-cumulative-returns

The authors sort the N stocks satisfying the data constraints in month (m-1) into 5 value-weighted portfolios based on their estimated \sigma_{z,n} values. The figure above reports the cumulative returns to these 5 test portfolios. The figure reads that if you invested \mathdollar 1 in the low idiosyncratic volatility portfolio in January 1963, then you would have over \mathdollar 100 more in December 2012 than if you had invested in the high idiosyncratic volatility portfolio. The figure below reports the estimated abnormal returns, \widehat{\alpha}_j, for each of the idiosyncratic volatility portfolios over the full sample and confirms that the poor performance of the high idiosyncratic volatility portfolio cannot be explained by exposure to common risk factors.

plot--ahxz06-table-6--capm-alphas

5. Are They Related?

I conclude by discussing the obvious follow-up question: “Are these 2 phenomena related?” After all, it could be the case that the firms with the highest exposure to aggregate return volatility also have the highest idiosyncratic volatility and vice versa. Ang, Hodrick, Xing, and Zhang (2006) show that this is not the case via a double sort. i.e., they show that within each aggregate volatility exposure portfolio, the stocks with the lowest idiosyncratic volatility outperform the stocks with the highest idiosyncratic volatility. Similarly, they show that within each idiosyncratic volatility portfolio, the stocks with the lowest aggregate volatility exposure outperform the stocks with the highest aggregate volatility exposure. Thus, the motivation driving investors to pay a premium for stocks with high aggregate volatility exposure is different from the motivation driving investors to pay a premium for stocks with high idiosyncratic volatility.

plot--r2-portfolios--capm-alphas

Indeed, you can pretty much guess this fact from the cumulative return plots in Sections 3 and 4 where the red lines denoting the low exposure portfolios behave in completely different ways. e.g., the low aggregate volatility exposure portfolio returns behave more or less like the high aggregate volatility exposure portfolio returns but with a higher mean. By contrast, the low idiosyncratic volatility portfolio returns are a much different time series with dramatically less volatility. Interestingly, if the authors sort on total volatility in month (m-1) rather than idiosyncratic volatility, then results are identical; however, the results to not carry through if you sort on R^2 in month (m-1). e.g., suppose you ran the same regression at the daily level in month (m-1) for each stock n = 1,2,\ldots,N:

(19)   \begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \cdot \mathbf{x}_d + \mathit{Error}_{n,d} \end{align*}

where the risk factors are the excess return on the value weighted market portfolio, the excess return on a size portfolio, and the excess return on a value portfolio as dictated by Fama and French (1993):

(20)   \begin{align*} \mathbf{x}_d^{\top} = \begin{bmatrix} r_{\mathrm{Mkt},d} & r_{\mathrm{SmB},d} & r_{\mathrm{HmL},d} \end{bmatrix} \end{align*}

Then, for each stock you computed the R^2 statistic measuring the fraction of the total variation in each stock’s excess returns that is explained by movements in the risk factors:

(21)   \begin{align*} R^2 &= 1 - \frac{\sum_{d=1}^{22}(r_{n,d} - \{\widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \mathbf{x}_d\})^2}{\sum_{d=1}^{22}(r_{n,d} - \langle r_{n,d} \rangle)^2} \end{align*}

If you group stocks into 5 portfolios based on their R^2 over the previous month, the figure above shows that there is no monotonic spread in the abnormal returns. Thus, the idiosyncratic volatility results seem to be more about volatility and less about the explanatory power of the Fama and French (1993) factors.

Using the Cross-Section of Returns

1. Introduction

The empirical content of the discount factor view of asset pricing can all be derived from the equation below:

(1)   \begin{align*} 0 = \mathrm{E}[m \cdot r_n] \quad \text{for all } n=1,2,\ldots,N \end{align*}

where m denotes the prevailing stochastic discount factor and r_n denotes an asset’s excess return. Equation (1) reads: “In the absence of margin requirements and transactions costs, it costs you \mathdollar 0 today to borrow at the riskless rate, buy a stock, and hold the position for 1 period.” The question is then why average excess returns, \mathrm{E}[r_n], vary across the N assets even though they all have the same price today by construction.

The answer hinges on the behavior of the stochastic discount factor, m, in Equation (1). What is this thing? Everyone knows that it is better to have \mathdollar 1 today than \mathdollar 1 tomorrow, and the present value of an asset that pays out \mathdollar 1 tomorrow is the called the discount rate. Sometimes important stuff will happen in the next 24 hours that changes how awesome it is to have an additional \mathdollar 1 tomorrow. As a result, the realized discount rate is a random variable each period (i.e., follows a stochastic process). e.g., if agents have utility, \mathrm{U}_0 = \mathrm{E}_0 \sum_{t \geq 0} e^{\rho \cdot t} \cdot c_t^{1-\theta}, then the stochastic discount factor is m = e^{-\rho - \theta \cdot \Delta \log c} and the stuff (i.e., risk factor) is changes in log consumption.

asset-pricing-theory

An asset pricing model is a machine which takes as inputs a) each agent’s preferences, b) each agent’s information, and c) a list of the relevant risk factors affecting how agents discount the future and produces a stochastic discount factor as its output. In this post, I show how to test an asset pricing model using the cross-section of asset returns. i.e., by linking how average excess returns vary across assets to each asset’s exposure to the risk factors governing the behavior of the stochastic discount factor.

2. Theoretical Predictions

The key to massaging Equation (1) into a form that can be taken to the data is noticing that for any 2 random variables u and v, the following identity holds:

(2)   \begin{equation*}  \mathrm{E}[u\cdot v] = \mathrm{Cov}[u,v] - \mathrm{E}[u] \cdot \mathrm{E}[v] \end{equation*}

Thus, if I let u denote the stochastic discount factor and v denotes any of the N excess returns, I can link the expected excess return to holding an asset to its covariance with the stochastic discount factor:

(3)   \begin{align*} \mathrm{E}[r_n] &= \frac{\mathrm{Cov}[m, r_n]}{\mathrm{Var}[m]} \cdot \left( - \frac{\mathrm{Var}[m]}{\mathrm{E}[m]} \right) \end{align*}

The first term is dimensionless and represents the amount of exposure asset n has to the risk factor x. The second term has dimension \sfrac{1}{\Delta t}, is common across all assets, and represents the price of exposure to the risk factor x since it has the same units as the expected return \mathrm{E}[r_n]. Asset pricing theories say that each asset’s expected return should be proportional to the market-wide prices of risk where the constant on proportionality is the asset’s “exposure” to that risk factor.

What does “exposure” mean here? To answer this question I need to put a bit more structure on the stochastic discount factor, m, and the excess return, r_n. I remain agnostic about which asset pricing model actually governs returns and which risk factors that affect discount rates, but to avoid writing out lots of messy matrices I do assume that there is only a single factor, x, with \mathrm{E}[x] = \mu_x and \mathrm{Var}[x] = \sigma_x^2. I then write the stochastic discount factor as the sum of a function of x, \mathrm{M}(x), and some noise, y \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_y^2):

(4)   \begin{align*} m &= \mathrm{M}(x) + y \\ &= \mathrm{M}(\mu_x) + \mathrm{M}'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{M}''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + y \\ &\approx \phi + \chi \cdot (x - \mu_x) + \frac{\psi}{2} \cdot (x - \mu_x)^2 + y \end{align*}

where I use a Taylor expansion to linearize the function \mathrm{M}(x) around the point x = \mu_x and assume terms of order \mathrm{O}(x - \mu_x)^3 are negligible so that \mathrm{E}[m] = \phi + \sfrac{\psi}{2} \cdot \sigma_x^2 and \mathrm{Var}[m] = \chi^2 \cdot \sigma_x^2 + \sigma_y^2. This means that if the risk factor is \sigma_x larger than expected, (x - \mu_x) = \sigma_x, then agents value having an additional \mathdollar 1 tomorrow \chi \cdot \sigma_x more than usual. Similarly, suppose each excess return is the sum of an asset-specific function of x, \mathrm{R}_n(x), and some asset-specific noise, z_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_z^2):

(5)   \begin{align*} r_n  &= \mathrm{R}_n(x) + z_n \\ &= \mathrm{R}_n(\mu_x) + \mathrm{R}_n'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{R}_n''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + z_n \\ &\approx \alpha_n + \beta_n \cdot (x - \mu_x) + \frac{\gamma_n}{2} \cdot (x - \mu_x)^2 + z_n \end{align*}

where I use a Taylor expansion to linearize the function \mathrm{R}_n(x) around the point x = \mu_x and assume \mathrm{O}(x - \mu_x)^3 terms are negligible so that \mathrm{E}[r_n] = \alpha_n + \sfrac{\gamma_n}{2} \cdot \sigma_x^2 and \mathrm{Var}[r_n] = \beta_n^2 \cdot \sigma_x^2 + \sigma_z^2. This means that if the risk factor is \sigma_x larger than expected, (x - \mu_x) = \sigma_x, then asset n‘s realized excess returns will be \beta_n \cdot \sigma_x larger than average.

Plugging Equations (4) and (5) into Equation (3) then shows exactly what “exposure” to the risk factor means:

(6)   \begin{equation*} \begin{split} \mathrm{E}[r_n] &= \frac{\mathrm{Cov}[m,r_n]}{\mathrm{Var}[m]} \cdot \left( - \, \frac{\mathrm{Var}[m]}{\mathrm{E}[m]} \right) \\ &= \frac{\chi \cdot \beta_n \cdot \sigma_x^2}{\chi^2 \cdot \sigma_x^2 + \sigma_y^2} \cdot \left( - \, \frac{\chi^2 \cdot \sigma_x^2 + \sigma_y^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \\ &= - \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \beta_n \\ &= \text{Constant} \times \beta_n \end{split} \end{equation*}

Each asset’s exposure to the risk factor x is summarized by the coefficient \beta_n. Assets which have higher realized returns when the risk factor is high (have a large \beta_n) will have lower average returns (high prices) since these assets are good hedges against the risk factor. i.e., these assets look like insurance. Equation (1)’s empirical content is then that an asset’s average excess returns, \langle r_n \rangle, is proportional to its exposure to the risk factor, \beta_n, where the constant of proportionality is the same for all assets:

(7)   \begin{align*} \mathrm{E}[r_n]  = \underbrace{\alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2}_{\text{Realized } \langle r_n \rangle} =  \underbrace{- \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \beta_n}_{\text{Predicted}} \end{align*}

By letting y,z_n \searrow 0 we can interpret this relationship as a realization of the first Hansen-Jagannathan bound:

(8)   \begin{align*} \frac{\mathrm{StD}[m_{t+1}]}{\mathrm{E}[m_{t+1}]} = \frac{\chi \cdot \sigma_x}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} = \frac{\alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2}{\beta_n \cdot \sigma_x} = \left| \frac{\mathrm{E}[r_{n,t+1}]}{\mathrm{StD}[r_{n,t+1}]} \right| \end{align*}

3. Empirical Strategy

To test Equation (7), an econometrician has to estimate (2 \cdot N + 2) unknown parameters:

(9)   \begin{align*} \widehat{\boldsymbol \theta} = \begin{bmatrix} \widehat{\mu}_x & \widehat{\alpha}_1 & \cdots & \widehat{\alpha}_N & \widehat{\beta}_1 & \cdots & \widehat{\beta}_N & \widehat{\lambda} \end{bmatrix}^{\top} \end{align*}

using T periods of observations. i.e., 2 parameters for each asset (its average excess returns and its factor exposure) as well as 2 market-wide parameters (the risk factor mean and the market price of risk). There are (3 \cdot N + 1) equations to estimate these parameters with via GMM so that the system is over-identified whenever there are N > 1 assets:

(10)   \begin{align*} \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}  &= \mathrm{E}[\mathrm{G}(\widehat{\boldsymbol \theta};\mathbf{r}_t,x_t)] = \mathrm{E} \begin{bmatrix} x_t - \widehat{\mu}_x \\ r_{1,t} - \left\{ \widehat{\alpha}_1 + \widehat{\beta}_1 \cdot (x_t - \widehat{\mu}_x) \right\} \\ \vdots \\ r_{N,t} - \left\{ \widehat{\alpha}_N + \widehat{\beta}_N \cdot (x_t - \widehat{\mu}_x) \right\} \\ \left( r_{1,t} - \left\{ \widehat{\alpha}_1 + \widehat{\beta}_1 \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ \vdots \\ \left( r_{N,t} - \left\{ \widehat{\alpha}_N + \widehat{\beta}_N \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ r_{1,t} - \widehat{\beta}_1 \cdot \widehat{\lambda} \\ \vdots \\ r_{N,t} - \widehat{\beta}_N \cdot \widehat{\lambda} \end{bmatrix} \end{align*}

The first equation pins down the mean of the factor x. The following (2 \cdot N) equations identify the \{\widehat{\alpha}_n,\widehat{\beta}_n\}_{n \in N} parameters governing the relationship between the risk factor and each asset’s excess returns. The final N equations pin down the market price of risk, \widehat{\lambda}, for exposure to the risk factor x. A risk is “priced” if \widehat{\lambda} \neq 0.

Note that this empirical strategy doesn’t pin down every single one of the parameters governing the relationship between the stochastic discount factor and each asset’s excess returns. e.g., the parameter estimates \widehat{\alpha}_n and \widehat{\lambda} are composites of several deep parameters:

(11)   \begin{align*} \widehat{\alpha}_n &= \alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2 \\ \widehat{\lambda} &= - \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \end{align*}

The underlying parameters \alpha_n and \gamma_n as well as \phi, \chi, and \psi are not identifiable from this approach since they satisfy conservation laws which leave the estimates for \widehat{\alpha}_n and \widehat{\lambda} unchanged:

(12)   \begin{align*} \frac{\partial \widehat{\alpha}_n}{\partial \alpha_n} \cdot \Delta \alpha_n + \frac{\partial \widehat{\alpha}_n}{\partial \gamma_n} \cdot \Delta \gamma_n = 0 &= \Delta \alpha_n + \frac{\sigma_x^2}{2} \cdot \Delta \gamma_n \\ \frac{\partial \widehat{\lambda}}{\partial \phi} \cdot \Delta \phi + \frac{\partial \widehat{\lambda}}{\partial \chi} \cdot \Delta \chi + \frac{\partial \widehat{\lambda}}{\partial \psi} \cdot \Delta \psi = 0 &= \left( \frac{\chi}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \{\Delta \phi + \frac{\sigma_x^2}{2} \cdot \Delta \psi\} - \Delta \chi \end{align*}

e.g., if you increase \alpha_n by \epsilon \approx 0^+ and decrease \gamma_n by \frac{2}{\sigma_x^2} \cdot \epsilon, then the estimate of \widehat{\alpha}_n remains unchanged.

4. Time Scale Considerations

There is a hidden assumption floating around behind the empirical strategy outlined in Section 3 above. Namely, that each asset’s factor exposure is constant and the market price of risk is constant. In practice, this is surely not the case as is documented in Jagannathan and Wang (1996) and Lewellen and Nagel (2006). OK… so constant factor exposures and prices of risk is an approximation. Fine. How good/bad an approximation is it? e.g., Fama and MacBeth (1973) use rolling T = 60 month windows to estimate each asset’s \widehat{\beta}_n. Is this too long a window relative to how much factor exposures vary over time? Alternatively, should we be using a longer window to more accurately pin down these parameters? It turns out that the estimation strategy gives some guidance about the relationship between the optimal estimation window and parameter persistence which I discuss below.

First, I model the evolution of the true parameters. To test an asset pricing model using the cross-section of excess returns, we are interested in knowing whether or not \widehat{\lambda} = 0. Suppose the true market price of risk, \lambda, follows a random walk:

(13)   \begin{align*} \lambda_T = \lambda + \sum_{t=1}^T l_t \end{align*}

where l_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_l^2) so that the final \lambda_T is a random variable with distribution:

(14)   \begin{align*}  \lambda_T \sim \mathrm{N}(\lambda, T \cdot \sigma_l^2) \end{align*}

Second, I note that the estimation strategy outlined in Section 3 above gives signal, \widehat{\lambda}, about the average market price of risk with distribution:

(15)   \begin{align*} \widehat{\lambda} \sim \mathrm{N}\left(\lambda, \sfrac{\sigma_s^2}{T}\right) \end{align*}

where s_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_s^2) denotes estimation error from the GMM procedure. There is an additional complication to consider. Namely, if the true market price of risk is floating around during the estimation period, it will add additional noise to the parameter estimates and increase \sigma_s^2. To keep things simple, suppose that nature sets the market price of risk to \lambda at the beginning of the estimation sample and it remains constant during estimation period. Then, \lambda_T is revealed at the end of time T and prevails afterwards. This will mean that the derivations below will be inequalities due to the underestimate of \sigma_s^2.

What I really care about is the distance between the true \lambda_T at the end of the sample which governs the market going forward and the GMM estimate of \widehat{\lambda}. Thus, I should choose out sample period length, T, to minimize:

(16)   \begin{align*} T  = \arg \min_{T \geq 0} \mathrm{E}\left[ (\lambda_T - \widehat{\lambda})^2 \right] = \arg \min_{T \geq 0} \mathrm{E}\left[ (\lambda_T - \lambda)^2 + (\lambda - \widehat{\lambda})^2 \right] \end{align*}

As a result, to find the optimal T I take the first order condition:

(17)   \begin{align*} 0 = \frac{d}{dT} \left[ T \cdot \sigma_l^2 + \left(\frac{1}{\sigma_{\lambda}^2} + \frac{T}{\sigma_s^2} \right)^{-1} \right] \end{align*}

where \sigma_{\lambda}^2 denotes the variance of my priors about the market price of risk governing the estimation sample \lambda. The solution to this equation defines the window length, T, which optimally trades off the benefit of getting a more precise estimate of \lambda with the cost of decreasing the relevance of this estimate due to the evolution of \lambda_T.

GMM maps \sigma_s^2 onto a parameter of the underlying model. To keep things simple, suppose there is only 1 asset and 4 unknown parameters:

(18)   \begin{align*} \widehat{\boldsymbol \theta} = \begin{bmatrix} \widehat{\mu}_x & \widehat{\alpha} & \widehat{\beta} & \widehat{\lambda} \end{bmatrix}^{\top} \end{align*}

so that the system of estimation equations reduces to:

(19)   \begin{align*} \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \end{pmatrix}  &= \mathrm{E}[\mathrm{G}(\widehat{\boldsymbol \theta};r_t,x_t)] = \mathrm{E} \begin{bmatrix} x_t - \widehat{\mu}_x \\ r_t - \left\{ \widehat{\alpha} + \widehat{\beta} \cdot (x_t - \widehat{\mu}_x) \right\} \\ \left( r_t - \left\{ \widehat{\alpha} + \widehat{\beta} \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ r_t - \widehat{\beta} \cdot \widehat{\lambda} \end{bmatrix} \end{align*}

This assumption means that I don’t have to consider how learning about one asset affects my beliefs about another asset. In this world, if x_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(\mu_x,\sigma_x^2), then GMM reduces to OLS and \sigma_s^2 = \sfrac{\sigma_z^2}{\beta_n^2} since:

(20)   \begin{align*} r_{n,t} = \beta_n \cdot \lambda  + \beta_n \cdot (x_t - \mu_x) + z_{n,t} \end{align*}

Evaluating the first order condition then gives:

(21)   \begin{align*} 0 = \sigma_l^2 - \left(\frac{1}{\sigma_{\lambda}^2} + \frac{T}{\sfrac{\sigma_z^2}{\beta_n^2}} \right)^{-2} \cdot \frac{1}{\sfrac{\sigma_z^2}{\beta_n^2}} \end{align*}

Solving for T yields:

(22)   \begin{align*} T &\geq \min\left\{ \, 0, \, \frac{\sigma_z}{\beta_n \cdot \sigma_l} - \frac{\sigma_z^2}{\beta_n^2 \cdot \sigma_{\lambda}^2} \, \right\} \end{align*}

Let’s plug in some values to make sure this formula makes sense. First, notice that if the market price of risk is constant, \lambda_T = \lambda, then \sigma_l = 0 and you should pick T = \infty or as large as possible. Second, notice that if you already know the true \lambda, then \sigma_{\lambda}^2 = 0 and you should pick T = 0. Finally, notice that if the test asset has no exposure to the risk factor, \beta_n = 0, then the equation is undefined since any window length gives you the same amount of information—i.e., none.