Phase Change in High-Dimensional Inference

1. Introduction

In my paper Feature Selection Risk (2014), I study a problem where assets have $Q \gg 1$ different attributes and traders try to identify which $K \ll Q$ of these attributes matter via price changes:

(1) $\begin{align*} \Delta p_n &= p_n - \mathrm{E}[p_n] = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{where} \qquad K = \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}} \notag \end{align*}$

with each asset’s exposure to a given attribute given by $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1)$ and the noise is given by $\epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2)$ . In the limit as $K,N,Q \to \infty$ , $\sfrac{K}{Q} \to 0$ , $(N - K) \cdot \beta \to \infty$ , and $\beta = \sfrac{1}{\sqrt{K}}$ there exists both a signal opacity bound, $N_O$ , as well as a signal recovery bound, $N_R$ :

(2) $\begin{align*} N_O \sim K \cdot \log \left( \frac{Q}{N_O} \right) \qquad \text{and} \qquad N_R \sim K \cdot \log \left( \frac{Q}{K} \right) \notag \end{align*}$

with $N_O \leq N_R$ in units of transactions. I explain what I mean by “ $\sim$ ” in Section 4 below. These $2$ thresholds separate the regions where traders are arbitrarily bad at identifying the shocked attributes (i.e., $N < N_O$ ) from the regions where traders can almost surely identify the shocked attributes (i.e., $N > N_R$ ). i.e., if traders have seen fewer than $N_O$ transactions, then they have no idea which shocks took place; whereas, if traders have seen more than $N_R$ transactions, then they can pinpoint exactly which shocks took place.

In this post, I show that the signal opacity and recovery bounds become arbitrarily close in a large market. The analysis in this post primarily builds on work done in Donoho and Tanner (2009) and Wainwright (2009).

2. Motivating Example

This sort of inference problem pops up all the time in financial settings. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying a list of recent sales prices, you find yourself a bit surprised. People seem to have changed their preferences for $1$ of $7$ different amenities: $^{(1)}$ a $2$ car garage, $^{(2)}$ a third bedroom, $^{(3)}$ a half-circle driveway, $^{(4)}$ granite countertops, $^{(5)}$ energy efficient appliances, $^{(6)}$ central A/C, or $^{(7)}$ a walk-in closet? The mystery amenity is raising the sale price of some houses by $\beta > 0$ dollars. How many sales do you need to see in order to figure out which of the $7$ amenities realized the shock?

The answer is $3$ . How did I arrive at this number? Suppose you found one house with amenities $\{1,3,5,7\}$ , a second house with amenities $\{2, 3, 6, 7\}$ , and a third house with amenities $\{4, 5, 6,7\}$ . The combination of the price changes for these $3$ houses reveals exactly which amenity has been shocked. i.e., if only the first house’s price was too high, $\Delta p_1 = p_1 - \mathrm{E}[p_1] = \beta$ , then Chicagoans must have changed their preferences for $2$ car garages:

(3) $\begin{equation*} \begin{bmatrix} \Delta p_1 \\ \Delta p_2 \\ \Delta p_3 \end{bmatrix} = \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \beta \\ 0 \\ \vdots \\ 0 \end{bmatrix} \end{equation*}$

By contrast, if $\Delta p_1 = \Delta p_2 = \Delta p_3 = \beta$ , then people must value walk-in closets more than they did a year ago.

Here’s the key point. The problem changes character at $N_R = 3$ observations. $3$ sales is just enough information to answer $7$ yes or no questions and rule out the possibility of no change: $7 = 2^3 - 1$ . $N = 4$ sales simply narrows your error bars around the exact value of $\beta$ . $N = 2$ sales only allows you to distinguish between subsets of amenities. e.g., seeing just the first and second houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more… not which one.

Yet, the dimensionality in this toy example can be confusing. There is obviously something different about the problem at $N_R = 3$ observations, but there is still some information contained in the first $N = 2$ observations. e.g., even though you can’t tell exactly which attribute realized a shock, you can narrow down the list of possibilities to $2$ attributes out of $7$ . If you just flipped a coin and guessed after seeing $N = 2$ transactions, you would have an error rate of $50{\scriptstyle \%}$ . This is no longer true in higher dimensions. i.e., even in the absence of any noise, seeing any fraction $(1 - \alpha) \cdot N_R$ of the required observations for $\alpha \in (0,1)$ will leave you with an error rate that is within a tiny neighborhood of $100{\scriptstyle \%}$ as the number of attributes gets large.

3. Non-Random Analysis

I start by exploring how the required number of observations, $N_R$ , moves around as I increase the number of attributes in the setting where there is only $K = 1$ shock and the data matrix is non-random. Specifically, I look at the case where $K = 1$ and $Q = 15$ . My goal is to build some intuition about what I should expect in the more complicated setting where the data $\mathbf{X}$ is a random matrix. Here, in this simple setting, the ideal data matrix would be $(4 \times 15)$ -dimensional and look like:

(4) $\begin{equation*} \underset{4 \times 15}{\mathbf{X}} = \left[ \begin{matrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \ \ \ \begin{matrix} 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{matrix} \right] \end{equation*}$

where each column of the data matrix corresponds to a number $q=1,2,\ldots,15$ in binary.

Let $S(N)$ be a function that eats $N$ observed price changes and spits out the set of possible preference changes that might explain the observed price changes. e.g., if traders only see the $1$ st transaction, then they can only place the shock in $1$ of $2$ sets containing $8$ attributes each:

(5) $\begin{align*} S(1) &= \begin{cases} \{ 1, 3, 5, 7, 9, 11, 13, 15 \} &\text{if } \Delta p_1 = \beta \\ \{ \emptyset, 2, 4, 6, 8, 10, 12, 14 \} &\text{if } \Delta p_1 = 0 \end{cases} \end{align*}$

The $2$ nd transaction then allows traders to split each of these $2$ larger sets into $2$ smaller ones and place the shock in a set of $4$ possibilities:

(6) $\begin{align*} S(2) = \begin{cases} \{ \emptyset, 4, 8, 12 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 5, 9, 13 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 \end{bmatrix}^{\top} \\ \{ 2, 6, 10, 14 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta \end{bmatrix}^{\top} \\ \{ 3, 7, 11, 15 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta \end{bmatrix}^{\top} \end{cases} \end{align*}$

With the $3$ rd transaction, traders can tell that the actual shock is either of $2$ possibilities:

(7) $\begin{align*} S(3) = \begin{cases} \{ \emptyset, 8 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 9 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & 0\end{bmatrix}^{\top} \\ \{ 2, 10 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & 0 \end{bmatrix}^{\top} \\ \{ 3, 11 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & 0 \end{bmatrix}^{\top} \\ \{ 4, 12 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & \beta \end{bmatrix}^{\top} \\ \{ 5, 13 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & \beta \end{bmatrix}^{\top} \\ \{ 6, 14 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & \beta \end{bmatrix}^{\top} \\ \{ 7, 15 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & \beta \end{bmatrix}^{\top} \end{cases} \end{align*}$

The $N_R = 4$ th observation then closes the case against the offending attribute.

Here’s the key observation. Only the absolute difference between $(N_R - N)$ matters when computing the size of the output of $S(N)$ . If traders have seen $N = (N_R - 1)$ transaction, then they can tell which subset of $2 = 2^1$ attributes has realized a shock. If traders have seen $N = (N_R - 2)$ transactions, then they can tell which subset of $4 = 2^2$ attributes has realized a shock. If traders have seen $N = (N_R - 3)$ observations, then they can tell which subset of $8 = 2^3$ attributes has realized a shock. Thus, after seeing any number of observations $N \leq N_R$ , traders can place the shock in a set of size $2^{N_R - N}$ . i.e., a trader has the same amount of information about which attribute has realized a shock in (i) a situation where $N_R = 100$ and he’s seen $N = 99$ transactions as in (ii) a situation where $N_R = 3$ and he’s seen $N = 2$ transactions.

The probability that traders select the correct attribute after seeing only $N \leq N_R$ observations is given by $2^{-(N_R - N)}$ assuming uniform priors. Natural numbers are hard to work with analytically, so let’s suppose that traders observe some fraction of the required number of observations $N_R$ . i.e., for some $\alpha \in (0,1)$ traders see $N = (1 - \alpha) \cdot N_R$ observations. We can then perform a change of variables $2^{- (N_R - N)} = e^{- \alpha \cdot \log(2) \cdot N_R}$ and answer the question: “How much does getting $1$ additional observation improve traders’ error rate?”

(8) $\begin{align*} \frac{1}{N_R} \cdot \frac{d}{d\alpha}\left[ \, 2^{- (N_R - N)} \, \right] = - \log(2) \cdot e^{- \alpha \cdot \log(2) \cdot N_R} \end{align*}$

I plot this statistic for $N_R$ ranging from $100$ to $800$ below. When $N_R = 100$ , a trader’s predictive power doesn’t start to improve until he sees $95$ transactions (i.e., $95{\scriptstyle \%}$ of $N_R$ ); by contrast, when $N_R= 800$ a trader’s predictive power doesn’t start to improve until he’s seen $N = 792$ transactions (i.e., $99{\scriptstyle \%}$ of $N_R$ ). Here’s the punchline. As I scale up the original toy example from $7$ attributes to $7$ million attributes, traders effectively get $0$ useful information about which attributes realized a shock until they come within a hair’s breadth of the signal recovery bound $N_R$ . The opacity and recovery bounds are right on top of one another.

4. Introducing Randomness

Previously, the matrix of attributes was strategically chosen so that the set of $N$ observations that traders see would be as informative as possible. Now, I want to relax this assumption and allow the data matrix to be random with elements $x_{n,q} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0,1)$ :

(9) $\begin{align*} \Delta p_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{for each } n = 1,2,\ldots,N \end{align*}$

where $\epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2)$ denotes idiosyncratic shocks affecting asset $n$ in units of dollars. For a given triplet of integers $(K,N,Q)$ with $0 < K < N < Q$ , I want to know whether solving the linear program:

(10) $\begin{align*} \widehat{\boldsymbol \beta} = \min \left\Vert {\boldsymbol \beta} \right\Vert_{\ell_1} \qquad \text{subject to} \qquad \mathbf{X} {\boldsymbol \beta} = \Delta \mathbf{p} \end{align*}$

recovers the true $\boldsymbol \beta$ when it is $K$ -sparse. i.e., when $\boldsymbol \beta$ has only $K$ non-zero entries $K = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}}$ . Since $N < Q$ the linear system is underdetermined; however, if the level of sparsity is sufficiently high (i.e., $K$ is sufficiently small), then there will be a unique solution with high probability.

First, I study the case where there is no noise (i.e., where $\sigma_\epsilon^2 \searrow 0$ ), and I ask: “What is the minimum number of observations needed to identify the true $\boldsymbol \beta$ with probability $(1 - \eta)$ for $\eta \in (0,1)$ using the linear program in Equation (10)?” I remove the noise to make the inference problem as easy as possible for traders. Thus, the proposition below which characterizes this minimum number of observations gives a lower bound. I refer to this number of observations as the signal opacity bound and write it as $N_O$ . The proposition shows that, whenever traders have seen $N < N_O$ observations, I can make traders’ error rate arbitrarily bad (i.e., $\eta \nearrow 1$ ) by increasing the number of attributes (i.e., $Q \nearrow \infty$ ).

Proposition (Donoho and Tanner, 2009): Suppose $\sfrac{K}{N} = \rho$ , $\sfrac{N}{Q} = \delta$ , and $N \geq N_0$ with $\rho, \delta \in (0,1)$ . The linear program in Equation (10) will recover $\boldsymbol \beta$ a fraction $(1 - \eta)$ if the time whenever:

(11) $\begin{align*} N > 2 \cdot K \cdot \log\left( \frac{Q}{N} \right) \cdot \left( 1 - R(\eta;N,Q) \right)^{-1} \end{align*}$

where $R(\eta;N,Q) = 2 \cdot \sqrt{\sfrac{1}{N} \cdot \log\left( 4 \cdot \sfrac{(Q + 2)^6}{\eta} \right)}$ .

Next, I turn to the case where there is noise (i.e., where $\sigma_\epsilon^2 > 0$ ), and I ask: “How many observations do traders need to see in order to identify the true $\boldsymbol \beta$ with probability $1$ in an asymptotically large market using the linear program in Equation (10)?” Define traders’ error rate after seeing $N$ observations as:

(12) $\begin{align*} \mathrm{Err}[N] &= \frac{1}{{Q \choose K}} \cdot \sum_{\substack{\mathcal{K} \in \mathcal{Q} \\ |\mathcal{K}| = K}} \mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) \end{align*}$

$\mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta})$ denotes the probability that the linear program in Equation (10) chooses the wrong subset of attributes (i.e., makes an error) given the true support $\mathcal{K}$ and averaging over not only the measurement noise, ${\boldsymbol \epsilon}$ , but also the choice of the Gaussian attribute exposure matrix, $\mathbf{X}$ . Traders’ error rate is the weighted average of these probabilities over every shock set of size $K$ . Traders identify the true ${\boldsymbol \beta}$ with probability $1$ in an asymptotically large market if:

(13) $\begin{align*} \lim_{\substack{K,N,Q \to \infty \\ \sfrac{K}{Q} \to 0}} \mathrm{Err}[N] &= 0 \end{align*}$

Thus, the proposition below which characterizes this number of observations gives an upper bound of sorts. I refer to this number of observations as the signal recovery bound and write it as $N_R$ . i.e., the proposition shows that, whenever traders have seen $N > N_R$ observations, they will be able to recovery $\boldsymbol \beta$ almost surely no matter how large I make the market.

Proposition (Wainwright, 2009): Suppose $K,N,Q \to \infty$ , $\sfrac{K}{Q} \to 0$ , $(N - K) \cdot \beta \to \infty$ , and $\beta = \sfrac{1}{\sqrt{K}}$ , then traders can identify the true ${\boldsymbol \beta}$ with probability $1$ in an asymptotically large market if for some constant $a > 0$ :

(14) $\begin{align*} N &> a \cdot K \cdot \log (\sfrac{Q}{K}) \end{align*}$

The only cognitive constraint that traders face is that their selection rule must be computationally tractable. Under minimal assumptions a convex optimization program is computationally tractable in the sense that the computational effort required to solve the problem to a given accuracy grows moderately with the dimensions of the problem. Natarajan (1995) explicitly shows that $\ell_0$ constrained linear programming is NP-hard. This cognitive constraint is really weak in the sense that any selection rule that you might look up in an econometrics or statistics textbook (e.g., forward stepwise regression or LASSO) is going to be computationally tractable. After all, they have to be executed on computers.

5. Discussion

What is really interesting is that the signal opacity bound, $N_O$ , and the signal recovery bound, $N_R$ , basically sit right on top of one another when the market gets large just as you would expect from the analysis in Section 3. The figure above plots each bound on a log-log scale for varying levels of sparsity. It’s clear from the figure that the bounds are quite close. The figure below plots the relative gap between these $2$ bounds:

(15) $\begin{align*} \frac{N_R - N_O}{N_R} \end{align*}$

i.e., it plots how big the gap is relative to the size of the signal recover bound $N_R$ . For each level of sparsity, the gap is shrinking as I add more and more attributes. This is an identical result as in the figure from Section 3: as the size of the market increases, traders learn next to nothing from each successive observation until they get within an inch of the signal recovery bound. The only difference here is that now there are an arbitrary number of shocks and the data matrix is random.