Scaling Up “Iffy” Decisions

1. Introduction

Imagine you are an algorithmic trader, and you have to set up a trading platform. How many signals should you try to process? How many assets should you trade? If you are like most people, your answer will be something like: “As many as I possibly can given my technological constraints.” Most people have the intuition that the only thing holding back algorithmic trading is computational speed. They think that left unchecked, computers will eventually uncover every trading opportunity no matter how small or obscure. e.g., as Scott Patterson writes in Dark Pools, these computerized trading platforms are just “tricked-out artificial intelligence systems designed to scope out hidden pockets in the market where they can ply their trades.”

This post highlights another concern. As you use computers to discover more and more “hidden pockets in the market”, the scale of these trades might grow faster than the usefulness of your information. Weak and diffuse signals might turn out to be really risky to trade on. How might this work? e.g., suppose that in order for your computers to recognize the impact of China’s GDP on US stock returns you need to feed them data for 500 assets; however, in order for your computers to recognize the impact of recent copper discoveries in Guatemala on electronics manufacturing company returns, you need to feed your machines data on 5000 assets. Even if the precision of the signal that your computers spit out is the same, its magnitude will be smaller. Copper discoveries in Guatemala just don’t matter as much. Thus, you would take on more risk by aggressively trading the second signal because it’s weaker and you would have to take on a position in 10 times as many assets! Thus, this post suggests that even if you might want to maximize the number of inputs your machines get, you might want to limit how broadly you apply them.

First, in Section 2 I illustrate how the scale of a trading opportunity might increase faster than the precision on your signal via an an example based on the well known Hodges’ Estimator. You should definitely check out Larry Wasserman‘s excellent post on the this topic for an introduction to the ideas. In Section 3 I then outline the basic decision problem I have in mind. In Section 4 I show how the risk associated with trading weak signals might explode as described above. Finally, in Section 4 I conclude by suggesting an empirical application of these ideas.

2. Hodges’ Estimator

Think about the following problem. Suppose that m_1, m_2, \ldots, m_Q \overset{\scriptscriptstyle \mathrm{iid}}{\sim} N(\mu,1) are random variables denoting (say) innovations in the quarterly dividends of a group of Q stocks:

(1)   \begin{align*} m_q &= d_{q,t+1} - \mathrm{E}_t[d_{q,t+1}] \end{align*}

You want to know if the mean of the dividend changes is non-zero. i.e., has this group of stocks realized a shock? Define the estimator \mathrm{Hdg}[Q] as follows:

(2)   \begin{align*} \mathrm{Hdg}[Q] &= \begin{cases} \mathrm{Avg}[Q] &\text{if } \mathrm{Avg}[Q] \geq Q^{-1/4} \\ 0 &\text{else } \end{cases} \end{align*}

where \mathrm{Avg}[Q] = \frac{1}{Q} \cdot \sum_{q=1}^Q m_q. This estimator says: “If the average of the dividend changes is sufficiently big, I’m going to assume there has been a shock of size \mathrm{Avg}[Q]; otherwise, I’ll assume that there’s been no shock.” This is an example of a Hodges-type estimator. \mathrm{Hdg}[Q] is a consistent estimator of \mu in the sense that:

(3)   \begin{align*} \sqrt{Q} \cdot (\mathrm{Hdg}[Q] - \mu) &\overset{\scriptscriptstyle \mathrm{Dist}}{\to}  \begin{cases} N(0,1) &\text{if } \mu \neq 0 \\ 0 &\text{if } \mu = 0 \end{cases} \end{align*}

Thus, as you examine more and more stocks from this group, you are guaranteed to discover the true mean. However, the worst case expected loss of the estimator, \mathrm{Hdg}[Q], is infinite!

(4)   \begin{align*} \sup_\mu \mathrm{E}\left[ Q \cdot (\mathrm{Hdg}[Q] - \mu)^2 \right] &\to \infty \end{align*}

This is true even though the worst case loss for the sample mean, \mathrm{Avg}[Q], is flat:

(5)   \begin{align*} \sup_\mu \mathrm{E}\left[ Q \cdot (\mathrm{Avg}[Q] - \mu)^2 \right] &\to 1 \end{align*}

I plot the risk associated with Hodges’ estimator in the figure below. What’s going on here? Well, as Q gets bigger and bigger, there remains a region around \mu = 0 where you are quite certain that the mean is 0. However, at the edge of this region, there is a band of values of \mu where if you are wrong and \mu \neq 0, then your prediction error summed across every one of the Q stocks you examined turns out to be quite big.

hodges-estimator-maximum-risk

3. Decision Problem

Now, think about a decision problem that generates a really similar decision rule—namely, try to figure out how many shares, a, to purchase where the stock’s payout is determined by N different attributes via the coefficients {\boldsymbol \mu}:

(6)   \begin{align*} \max_a V(a;{\boldsymbol \mu},\mathbf{x}) &= \max_a V(a) = \min_a \frac{\gamma}{2} \cdot \left( a - \sum_{n=1}^N \mu_n \cdot x_n \right)^2 \end{align*}

Here, you are trying to maximize your value, V, by choosing the right number of shares to hold—i.e., the risk action. Ideally you would take the action which is exactly equal to the ideal action a = \sum_{n=1}^N \mu_n \cdot x_n; however, it’s hard to figure out the exact loadings, {\boldsymbol \mu}, for every single one of the N relevant dimensions. As a result, you take an action a which isn’t exactly perfect:

(7)   \begin{align*} a &= A(\mathbf{m};\mathbf{x}) = A(\mathbf{m}) = \sum_{n=1}^N m_n \cdot x_n \end{align*}

I use the function L(\cdot) to denote the loss in value you suffer by choosing a suboptimal asset holding:

(8)   \begin{align*} L(\mathbf{m};{\boldsymbol \mu}) = L(\mathbf{m}) &= \mathrm{E}\left[ \ V(A({\boldsymbol \mu}); {\boldsymbol \mu}, \mathbf{x}) - V(A(\mathbf{m}); {\boldsymbol \mu}, \mathbf{x}) \ \right] \\ &= - \frac{\gamma}{2} \cdot \mathrm{E}\left[ \left( \sum_{n=1}^N (m_n - \mu_n) \cdot x_n \right)^2 \right] \\ &= \frac{\gamma}{2} \cdot \mathrm{E}\left[ \sum_{n=1}^N (m_n - \mu_n)^2 \right] \end{align*}

where the 3rd line follows from assuming that x_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} N(0,1).

So which of the N different details should you pay attention to? Which of them should you ignore? One way to frame this problem would be to look for a sparse solution and only ask your computers to trade on the signals that are sufficiently important. Gabaix (2012) shows how to do this using an \ell_1-program, I outline the main ideas in an earlier post. Basically, you would try to minimize your loss from taking a suboptimal action subject to an \ell_1-penalty:

(9)   \begin{align*} L(\mathbf{m}) &= \min_{\mathbf{m} \in \mathrm{R}^N} \left\{ \frac{\gamma}{2} \cdot \sum_{n=1}^N (m_n - \mu_n)^2 + \kappa \cdot \sum_{n=1}^N |m_n| \right\} \end{align*}

so that your optimal choice of actions is given by:

(10)   \begin{align*} m_n &=  \begin{cases} \mu_n &\text{if } |\mu_n| \geq \kappa \\ 0 &\text{if } |\mu_n| < \kappa \end{cases} \end{align*}

That’s the decision problem. So far everything is old hat.

4. Maximum Risk

The key insight in this post is that these \mu_n terms don’t come down from on high. As a trader, you don’t just know what these terms are from the outset. They weren’t stitched onto the forehead of your favorite teddy bear. Instead, you have to use data on lots of stocks to estimate them as you go. In this section, I think about a world where you feed data on Q different stocks to your machines, and each of these assets has the appropriate action \sum_{n=1}^N \mu_{n,q} \cdot x_{n,q}. I then investigate what happens when you use the estimator:

(11)   \begin{align*} \widetilde{m}_n &=  \begin{cases} \mathrm{Avg}_n[Q] &\text{if } |\mathrm{Avg}_n[Q]| \geq \kappa \\ 0 &\text{if } |\mathrm{Avg}_n[Q]| < \kappa \end{cases} \end{align*}

instead of the estimator in Equation (10) where \mathrm{Avg}_n[Q] = \frac{1}{Q} \cdot \sum_{q=1}^Q \mu_{n,q} and \kappa isn’t growing that fast as Q gets larger and larger. i.e., we have that:

(12)   \begin{align*} \widetilde{m}_n &= \mathrm{Avg}_n[Q] \cdot 1_{\{ |\mathrm{Avg}_n[Q]| \geq \kappa \}} \quad \text{with} \quad \kappa = \mathrm{O}(\sqrt{2 \log Q}) \end{align*}

In some senses, \widetilde{\mathbf{m}} is still a really good estimator of {\boldsymbol \mu}. To see this, let \mu_n denote the true effect for a particular attribute. Then, clearly for |\mu_n| \geq \kappa, we have that:

(13)   \begin{align*} \sqrt{Q} \cdot (\widetilde{m}_n - \mu_n) &\overset{\scriptscriptstyle \mathrm{Dist}}{\to} \mathrm{N}(0,1) \end{align*}

Similarly, for \mu_n = 0, we have that:

(14)   \begin{align*} \sqrt{Q} \cdot (\widetilde{m}_n - \mu_n) &\overset{\scriptscriptstyle \mathrm{Dist}}{\to} 0 \end{align*}

Thus, the estimate \widetilde{m}_n is a strictly better estimator of \mu_n than the sample average. Pretty cool.

However, in another sense, it’s a terrible estimator since the maximum risk associated with trading on \widetilde{\mathbf{m}} is unbounded:

(15)   \begin{align*} \sup_{\mu_n} R_n(\widetilde{\mathbf{m}}) &\to \infty \quad \text{where} \quad R_n(\widetilde{\mathbf{m}}) = \mathrm{E}\left[ Q \cdot (\widetilde{m}_n - \mu_n)^2 \right] \end{align*}

This result means that if you use this estimator to trade on, there are parameter values for \mu_n which lead your computer to take on positions that are infinitely risky and make you really unhappy! How is this possible? Well, take a look at the following decomposition of the risk associated with the estimator \widetilde{\mathbf{m}}:

(16)   \begin{align*} R_n(\widetilde{\mathbf{m}}) &= \mathrm{E}\left[ Q \cdot (\widetilde{m}_n - \mu_n)^2 \right] \\ &= \mathrm{E}\left[ Q \cdot (\mathrm{Avg}_n[Q] \cdot 1_{\{ |\mathrm{Avg}_n[Q]| \geq \kappa \}} - \mu_n)^2 \right] \\ &= \mathrm{Pr}\left[ |\mathrm{Avg}_n[Q]| \geq \kappa \right] \cdot \mathrm{E}\left[ Q \cdot (\mathrm{Avg}_n[Q] - \mu_n)^2 \right] + \mathrm{Pr}\left[ |\mathrm{Avg}_n[Q]| < \kappa \right] \cdot \mathrm{E}\left[ Q \cdot \mu_n^2 \right] \end{align*}

It turns out that there are choices of \mu_n for which \mathrm{Pr}\left[ |\mathrm{Avg}_n[Q]| < \kappa \right] \to 1 as Q \to \infty, but at the same time \mathrm{E}\left[ Q \cdot \mu_n^2 \right] \to \infty for the same process. In words, this means that there are choices of \mu_n which make dimension n always appear to your machines as an unimportant detail no matter how many stocks you look at, but this conclusion is still risky enough that if you applied to every one of these Q stocks you’d get an infinitely risky portfolio.

e.g., consider the case where:

(17)   \begin{align*} \mu_n &= \hbar \cdot Q^{-1/4} \quad \text{with} \quad 0 < \hbar < 1 \end{align*}

Then, it’s easy to see that:

(18)   \begin{align*} \mathrm{Pr}_{\mu_n} \left[ \ |\mathrm{Avg}_n[Q]| < Q^{-1/4}  \ \right] &= \mathrm{Pr}_{\mu_n} \left[ \ Q^{-1/4} < \mathrm{Avg}_n[Q] < Q^{-1/4}  \ \right] \\ &= \mathrm{Pr}_{\mu_n} \left[ \ \sqrt{Q} \cdot \left(- Q^{-1/4} - \mu_n \right) < z < \sqrt{Q} \cdot \left(- Q^{-1/4} - \mu_n \right)  \ \right] \\ &= \mathrm{Pr}_{\mu_n} \left[ \ - Q^{1/4} \cdot ( 1 + \hbar ) < z < Q^{1/4} \cdot ( 1 - \hbar )  \ \right] \end{align*}

where z \overset{\scriptscriptstyle \mathrm{iid}}{\sim} N(0,1). But, this means that as Q \to \infty, we have that:

(19)   \begin{align*} \mathrm{Pr}_{\mu_n} \left[ \mathrm{Avg}_n[Q] < \kappa \right] &\to 1 \quad \text{while} \quad \mathrm{E}\left[ Q \cdot \mu_n^2 \right] \to \infty \end{align*}

Thus, we have a proof by explicit construction. The punchline is that unleashing your machines on the market might lead them into situations where the scale of the position grows faster than the precision of the signal.

5. Empirical Prediction

What does this result mean in the real world? Well, the argument in the section above relied on the fact that when you used Q different assets to derive a signal, you also then traded on all Q assets. i.e., this is why the risk function has a factor of Q in it. Thus, one way to get around this problem would be to commit to trading a smaller number of asset Q' where:

(20)   \begin{align*} Q \gg Q' \end{align*}

even though you had to use Q assets to recover the signal. i.e., even if you have to grab signals from the 4 corners of the market to create your trading strategy, you might nevertheless specialize in trading only a few assets so that in contrast to Equation (19) you would always have:

(21)   \begin{align*} \mathrm{Pr}_{\mu_n} \left[ \mathrm{Avg}_n[Q] < \kappa \right] &\to 1 \quad \text{while} \quad \mathrm{E}\left[ Q' \cdot \mu_n^2 \right] \to \mathrm{const} < \infty \end{align*}

If traders are actually using signals from lots of assets, but only trading a few of them to avoid the infinite risk problem, then this theory would give a new motivation for excess comovement. i.e., you and I might feed the same data into our machines, get the same output, and choose to trade on entirely different subsets of stocks.