Effective Financial Theories

1. Introduction

One of the most astonishing things about financial markets is that there is interesting economics operating at so many different scales. Yet, no one would ever guess this fact by looking at standard asset pricing theory. To illustrate, take a look at the canonical Euler equation:

(1) $\begin{align*} p_{n,t} &= \mathrm{E}_t \left[ m_{t+1} \cdot \left(p_{n,t+1} + d_{n,t+1}\right) \right] \end{align*}$

Here, $p_{n,t}$ and $d_{n,t}$ denote the ex-dividend price and dividend payout of the $n$ th asset in the economy at time $t$ , $m_{t+1}$ denotes the prevailing stochastic discount factor, and $\mathrm{E}_t(\cdot)$ denotes the conditional expectations operator given time $t$ information. Equation (1) says that the price of the $n$ th asset in the current period, $t$ , is equal to the expected discounted value of the asset’s price and dividend payout in the following period, $(t+1)$ . At first glance this formulation seems perfectly sensible, but a closer look reveals two striking features:

Time is dimensionless. i.e., Equation (1) is written in sequence time not wall clock time. Each period could equally well represent a millisecond, an hour, a year, a millenium, or anything in between. We usually think of the stochastic discount factor, $m_{t+1}$ , as a function of traders’ utility from aggregate consumption. Thus, as Cochrane (2001) points out, if “stocks go up between 12:00 and 1:00, it must be because (on average) we all decided to have a big lunch…. this seems silly.”
The total number of stocks doesn’t show up anywhere in Equation (1). Not only do traders have to know when there is a profitable arbitrage opportunity somewhere out there in the market, they also have to find out exactly where this opportunity is and deploy the necessary funds and expertise to exploit it. Where’s Waldo? puzzles are hard for a reason. Identifying and trading into arbitrage opportunities is a fundamentally different activity when searching through $10000$ rather than $10$ predictors. More is different. This is the key insight highlighted in Chinco (2012).

In this post, I start by writing down a simple statistical model of returns in Section 2 which allows for shocks at different time horizons and across asset groupings of various sizes. Then, in Sections 3 and 4, I show how shocks at vastly different scales are difficult for traders to spot (…let alone act on). Such shocks can look like noise to “distant” traders in a mathematically precise sense. In Section 5, I conclude with a discussion of these observations. The key take away is that financial theories do not necessarily need to be globally applicable to make effective local predictions. e.g., a theory governing the optimal behavior of a high frequency trader may not have any testable predictions at the quarterly investment horizon where institutional investors operate.

2. Statistical Model

I start by writing down a statistical model of returns that allows for shocks at different time scales and across asset groupings of different sizes. e.g., Apple’s stock returns might be simultaneously affected by not only bid-ask bounce at the $100{\scriptstyle \mathrm{ms}}$ investment horizon but also momentum at the $1{\scriptstyle \mathrm{mo}}$ investment horizon. Alternatively, at the $1{\scriptstyle \mathrm{qtr}}$ Apple might realize both an earnings announcement shock as well as a national economic shock felt by all US firms.

Let $\hbar$ denote the smallest investment horizon, so that all other time scales are indexed by an $A_h = 1,2,3,\ldots$ :

(2) $\begin{align*} h &= A_h \cdot \hbar \end{align*}$

For concreteness, you might think about $\hbar = (\mathrm{something}) \times 10^{-3}{\scriptstyle \mathrm{sec}}$ in modern asset markets. Thus, for a monthly investment horizon $A_{\mathrm{month}} = (\mathrm{something}) \times 10^9$ meaning that asset market investment horizons span somewhere between $9$ and $11$ orders of magnitude from high frequency traders to buy and hold value investors. This is a similar ratio to the ratio of the height of human to the diameter of the sun.

Click to Embiggen. Source: Delphix.

Let $r_n(t,h)$ denote the log price change of the $n$ th stock from time $t$ through time $(t + h)$ :

(3) $\begin{align*} r_n(t,h) &= \log p_n(t+h) - \log p_n(t) = \sum_{q=1}^Q \delta_q(t,h) \cdot x_{n,q} + \epsilon_n(t,h) \end{align*}$

where $x_{n,q} \in \{0,1\}$ denotes whether or not stock $n$ has attribute $q$ , $\delta_q(t,h)$ denotes the mean growth rate in the price of all stocks with attribute $q$ from time $t$ through time $(t+h)$ , and $\epsilon_n(t,h)$ denotes idiosyncratic noise in stock $n$ ‘s percent return from time $t$ through time $(t+h)$ . e.g., suppose that the mean growth rate of all technology stocks from January $1$ st, 1999 through the end of January $31$ st, 1999 was $120{\scriptstyle \%/\mathrm{yr}}$ or $10{\scriptstyle \%/\mathrm{mo}}$ . Then, I would write that:

(4) $\begin{align*} \delta_{\mathrm{technology}}(\mathrm{Jan}1999,1{\scriptstyle \mathrm{mo}}) &= 0.10 \end{align*}$

and Intel, Inc would realize a $10{\scriptstyle \%/\mathrm{mo}}$ boost in its January, 1999 returns since:

(5) $\begin{align*} x_{\mathrm{INTL},\mathrm{technology}} = 1 \end{align*}$

The price shocks, $\delta_q(t,h)$ , take on the form:

(6) $\begin{align*} \delta_q(t,h) &= \sum_{a=0}^{A_h-1} \delta_q(t + a \cdot \hbar,\hbar) \quad \text{with} \quad \delta_q(t,\hbar) = \begin{cases} s_q &\text{w/ prob} \quad \frac{1}{2} \cdot \left( 1 - e^{- f_q \cdot \hbar} \right) \\ 0 &\text{w/ prob} \quad e^{- f_q \cdot \hbar} \\ - s_q &\text{w/ prob} \quad \frac{1}{2} \cdot \left( 1 - e^{- f_q \cdot \hbar} \right) \end{cases} \end{align*}$

The summation captures the idea that all shocks occur in a particular instant and then cumulate over time. e.g., there is a particular time interval, $\hbar$ , during which a news release hits the wire or a market order flashes across the screen. Changes over time intervals longer than $\hbar$ reflect the accumulation of changes across these tiny time intervals. The parameters $s_q$ and $f_q$ control the size and frequency of the $q$ th shock. Each attribute’s size parameter has units of percent per $\hbar$ , and the bigger the $s_q$ the bigger the impact of the $q$ th shock on the returns of all stocks with that attribute. Each attribute’s frequency parameter has units of shocks per $\hbar$ , and the bigger the $f_q$ the more often all stocks with attribute $q$ realize a shock of size $s_q$ . The idiosyncratic return noise is the summation of Gaussian shocks at each $\hbar$ interval:

(7) $\begin{align*} \epsilon_n(t,h) &= \sum_{a=0}^{A_h-1} \epsilon_n(t + a \cdot \hbar,\hbar) \quad \text{with} \quad \epsilon_n(t,\hbar) \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}\left( 0, \sigma_u \cdot \sqrt{\hbar}\right) \end{align*}$

3. Time Series

Very different financial theories can operate at vastly different time scales. e.g., attributes that are relevant at the millisecond time horizon will completely wash out by the monthly horizon and vice versa. In this section, I look at only the time series properties of one stock, so I suppress the $n$ subscript and write Equation (3) as:

(8) $\begin{align*} r(t,h) &= \sum_{a = 0}^{A_h-1} r(t+a \cdot \hbar,\hbar) = \sum_{q=1}^Q \delta_q(t,h) \cdot x_q + \epsilon(t,h) \end{align*}$

To see why, consider the problem of a value investor, Alice, operating at the monthly investment horizon. Suppose that she wants to know whether or not her arch nemesis Bill, a high frequency trader operating at the millisecond investment horizon, is actively trading in her asset. e.g., suppose that she is worried that Bill might have found some really clever new predictor that flits in and out of existence before she can take advantage of it. From Alice’s point of view, the random variable $\delta_q(t,\hbar)$ has the unconditional distribution:

(9) $\begin{align*} \begin{split} \mathrm{E}\left[ \delta_q(t,\hbar) \right] &= 0 \\ \mathrm{E}\left[ \delta_q(t,\hbar)^2 \right] &= \left( 1 - e^{- f_q \cdot \hbar} \right) \cdot s_q^2 = \sigma_q^2 \\ \mathrm{E}\left[ \left| \delta_q(t,\hbar) \right|^3 \right] &= \left( 1 - e^{- f_q \cdot \hbar} \right) \cdot s_q^3 = \rho_q \end{split} \end{align*}$

Let $F_{A_h}(x)$ denote the cumulative distribution function of $\delta_q(t,h)/(\sigma_q \cdot \sqrt{A_h})$ . e.g., $F_{A_h}(x)$ governs the cumulative distribution of the average of the shocks that Bill sees over the length of each period from Alice’s perspective. Then, via the Berry-Esseen theorem we have that at the monthly investment horizon:

(10) $\begin{align*} \left| F_{A_h}(x) - \Phi(x) \right| &\leq \frac{0.7655 \cdot \rho_q}{\sigma_q^3 \cdot \sqrt{A_h}} = \frac{1}{\sqrt{A_h}} \cdot \left( \frac{0.7655}{\sqrt{1 - e^{- f_q \cdot \hbar}}} \right) = (\mathrm{something}) \times 10^{-5} \end{align*}$

Equation (10) says that the maximum vertical distance between the CDF of the monthly mean of the variable fluctuating at the $\hbar$ time scale is identical to the normal distribution to within one part in one-hundred thousand.

Click to embiggen. This image shows the distance between the cumulative distribution functions of the standard normal distribution, $\Phi(x)$ , and the empirical distribution, $F_{A_h}(x)$ , as computed above.

There are a couple of ways to put this figure in perspective. First, note that trading strategies have to generate well above $0.5{\scriptstyle \%}$ abnormal returns per month in order to outpace trading costs. Second, note that Alice would need around $10^{10}{\scriptstyle \mathrm{mo}}$ of data to distinguish between a variable drawn from the standard normal distribution and $F_{A_h}(x)$ at this level of granularity via the Kolmogorov–Smirnov test. Thus, Bill’s behavior at the $\hbar$ investment horizon is effectively noise to Alice when looking only at monthly data. In order to figure out what Bill is doing, she has to stoop down to his investment horizon.

4. Cross Section

In the same way that different financial theories can operate at different time scales, different financial theories can also operate at vastly different levels of aggregation. On one hand, this statement is a bit obvious. After all, modern financial theory is built on the idea of risk minimization through portfolio diversification, and traders talk about strategies being “market neutral”. On the other hand, diversification is not the only force at work. Financial markets have many assets and traders use a vast number of predictors. What’s more, only a few of these predictors are useful at any point in time. As Warren Buffett says, “If you want to shoot rare, fast-moving elephants, you should always carry a loaded gun.” Pulling the trigger is easy. Finding the elephant is hard. Traders face a difficult search problem when trying to parse new shocks.

Suppose that Alice is a value investor specializing in oil and gas stocks and now wants to figure out where her other arch nemesis, Charlie, is trading in her market. Even if she knows that he is trading at roughly her investment horizon, it may still be hard for her to spot his price impact due to the vast number of possible strategies that he could be employing. In this section I study the $1{\scriptstyle \mathrm{mo}}$ returns of $N$ stocks with $Q=7$ attributes:

(11) $\begin{align*} r_n &= \sum_{q=1}^7 \delta_q \cdot x_{n,q} + \epsilon_n \end{align*}$

where I suppress all the time horizon arguments since I am concerned with the cross-section. For simplicity, suppose that Alice knows that Charlie is making a bet on only $1$ of the $7$ attributes so that:

(12) $\begin{align*} 1 &= \Vert {\boldsymbol \delta} \Vert_{\ell_0} = \sum_{q=1}^7 1_{\{\delta_q \neq 0\}} \end{align*}$

where if $\delta_q \neq 0$ , then $\delta_q = s \gg \sigma_\epsilon$ for all $q =1,2,\ldots,7$ . e.g., Alice is worried that Charlie’s spotted the one way of sorting all oil and gas stocks so that all the stocks with that attribute (e.g., operations in the Chilean Andes) have high returns and all of the stocks without the attribute have low returns. How many stocks does Alice have to follow in order for her to spot the sorting rule—i.e., the non-zero entry in $({\boldsymbol \delta})_{7 \times 1}$ ?

It turns out that Alice only needs to examine $3$ stocks so long as she gets to pick exactly which ones:

Stock $1$ : Has attributes $1$ , $3$ , $5$ , $7$
Stock $2$ : Has attributes $2$ , $3$ , $6$ , $7$
Stock $3$ : Has attributes $4$ , $5$ , $6$ , $7$

The fact that Alice can identify the correct attribute even though she has fewer observations than possible attributes, $Q \gg N$ , is known as compressive sensing and was introduced by Candes and Tao (2005) and Donoho (2006). See Terry Tao’s blog post for an excellent introduction. For example, suppose that only the first stock had high returns of $r_1 \approx s$ :

(13) $\begin{align*} \underbrace{\begin{bmatrix} s \\ 0 \\ 0 \end{bmatrix}}_{(\mathbf{r})_{3 \times 1}} &\approx \underbrace{\begin{bmatrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix}}_{(\mathbf{X})_{3 \times 7}} \underbrace{\begin{bmatrix} s \\ 0 \\ \vdots \\ 0 \end{bmatrix}}_{({\boldsymbol \delta})_{7 \times 1}} + \underbrace{\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix}}_{({\boldsymbol \epsilon})_{3 \times 1}} \end{align*}$

then Alice can be sure that Charlie has been sorting using the first of the $7$ stock attributes. The interesting part is Alice can’t identify Charlie’s strategy using any less than $N = 3$ stocks since:

(14) $\begin{align*} 7 = 2^3 - 1 \end{align*}$

e.g., $3$ stocks gives Alice just enough combinations to answer $7$ yes or no questions.

What’s more, this result generalizes to the case where the data matrix, $\mathbf{X}$ , is stochastic rather than deterministic. i.e., in real life Alice can’t decide how many oil and gas stocks with each attribute are traded each period in order to make it easiest to decipher Charlie’s trading strategy. Donoho and Tanner (2009) show that in a world where $\mathbf{X}$ is a random matrix with Gaussian entries, $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1/N)$ , there is a maximum number of predictors, $K^*$ , above which it is impossible for Alice to spot $K > K^*$ relevant attributes from among $Q$ possibilities using only $N$ stocks given by:

(15) $\begin{align*} N &= 2 \cdot K^* \cdot \log(Q/N) \cdot (1 + \mathrm{o}(1)) \end{align*}$

and is summarized in the figure below replicated from Donoho and Stodden (2006). The $x$ -axis runs from $0$ to $1$ and gives values for $N/Q$ summarizing the relative amount of data available to Alice. The $y$ -axis also runs from $0$ to $1$ and gives values for $K/N$ summarizing the level of sparsity in the model. The underlying model is:

(16) $\begin{align*} r_n = \mathbf{x}_n {\boldsymbol \delta} + \epsilon_n \end{align*}$

where $\epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, 1/20)$ , ${\boldsymbol \delta}$ is zero everywhere except for $K$ entries which are $1$ , and each $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1/\sqrt{Q})$ with columns normalized to unit length. The forward stepwise regression procedure enters variables into the model in a sequential fashion, according to greatest $t$ -statistic value. The procedure iteratively takes the single regressor with the highest $t$ -statistic until reaching the $\sqrt{2 \cdot \log Q}$ threshold (i.e., the Bonferroni threshold) which is roughly $3.25$ when $Q = 200$ . The $K^*$ threshold given by Donoho and Tanner (2009) then corresponds to the white diagonal line cutting through the phase space above which linear regression procedure fails and below which it succeeds.

Click to embiggen. This figure shows the average prediction error $\Vert {\boldsymbol \delta} - \hat{\boldsymbol \delta} \Vert_{\ell_2}^2/\Vert {\boldsymbol \delta} \Vert_{\ell_2}^2$ from the forward stepwise regression procedure described above.

The interesting part about this result is that this bound on $K^*$ comes from a deep theorem in high-dimensional geometry which relates both compressive sensing and error correcting code as suggested by the deterministic example above. It is not due to any knitty gritty details of Alice’s search problem. Notice how the original bound in the $Q=7$ and $N=3$ example has an information theoretic interpretation! Thus, Charlie can hide behind the sheer number of possible explanations in the cross section in the same way that Bill can hide behind the sheer number of observations in the time series.

5. Discussion

The speed at which traders interact has greatly increased over the past decade. e.g., Spread Networks invested approximately $\mathdollar 300{\scriptstyle \mathrm{mil}}$ in a new fiber optic cable linking New York and Chicago via the straightest possible route saving about $100$ miles and shaving $6{\scriptstyle ms}$ off their delay. Table 5 in Pagnotta and Philippon (2012) documents the many investments in speed made by exchanges around the world. What’s more, trading behavior at this time scale seems to be decoupled from asset fundamentals. i.e., it’s unlikely that a stock’s value truly follows any of the patterns found in one of Nanex’s crop-circle-of-the-day plots. Motivated by events such as the flash crash there has been a great deal of discussion in recent years about the impact of high frequency trading on asset prices and welfare.

However, the rough calculations above suggest that traders with a monthly investment horizon might not even care about second-to-second fluctuations in asset prices. e.g., think of how high and low frequency bands of the same radio wave can carry rock and classical music to your FM radio receiver without interfering with one another. High frequency trading may be revealing nothing about the fundamental value of the companies in the market place, but just because these traders make short-run returns behave strangely doesn’t mean that they will ruin the market for institutional investors trading at a longer horizon. In this light, perhaps the canonical Euler equation needs to have some additional input parameters, $N$ , $Q$ and $h$ :

(17) $\begin{align*} p_n(t) &= \mathrm{E}_t\left[ m_{N,Q}(t,h) \cdot \left\{ p_n(t+h) + d_n(t+h) \right\} \right] \end{align*}$

which define the range over which the theory is effective?