Phase Change in High-Dimensional Inference

1. Introduction

In my paper Feature Selection Risk (2014), I study a problem where assets have Q \gg 1 different attributes and traders try to identify which K \ll Q of these attributes matter via price changes:

(1)   \begin{align*} \Delta p_n &= p_n - \mathrm{E}[p_n] = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{where} \qquad K = \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}} \notag \end{align*}

with each asset’s exposure to a given attribute given by x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1) and the noise is given by \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2). In the limit as K,N,Q \to \infty, \sfrac{K}{Q} \to 0, (N - K) \cdot \beta \to \infty, and \beta = \sfrac{1}{\sqrt{K}} there exists both a signal opacity bound, N_O, as well as a signal recovery bound, N_R:

(2)   \begin{align*} N_O \sim K \cdot \log \left( \frac{Q}{N_O} \right) \qquad \text{and} \qquad N_R \sim K \cdot \log \left( \frac{Q}{K} \right) \notag \end{align*}

with N_O \leq N_R in units of transactions. I explain what I mean by “\sim” in Section 4 below. These 2 thresholds separate the regions where traders are arbitrarily bad at identifying the shocked attributes (i.e., N < N_O) from the regions where traders can almost surely identify the shocked attributes (i.e., N > N_R). i.e., if traders have seen fewer than N_O transactions, then they have no idea which shocks took place; whereas, if traders have seen more than N_R transactions, then they can pinpoint exactly which shocks took place.

In this post, I show that the signal opacity and recovery bounds become arbitrarily close in a large market. The analysis in this post primarily builds on work done in Donoho and Tanner (2009) and Wainwright (2009).

2. Motivating Example

This sort of inference problem pops up all the time in financial settings. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying a list of recent sales prices, you find yourself a bit surprised. People seem to have changed their preferences for 1 of 7 different amenities: ^{(1)}a 2 car garage, ^{(2)}a third bedroom, ^{(3)}a half-circle driveway, ^{(4)}granite countertops, ^{(5)}energy efficient appliances, ^{(6)}central A/C, or ^{(7)}a walk-in closet? The mystery amenity is raising the sale price of some houses by \beta > 0 dollars. How many sales do you need to see in order to figure out which of the 7 amenities realized the shock?

The answer is 3. How did I arrive at this number? Suppose you found one house with amenities \{1,3,5,7\}, a second house with amenities \{2, 3, 6, 7\}, and a third house with amenities \{4, 5, 6,7\}. The combination of the price changes for these 3 houses reveals exactly which amenity has been shocked. i.e., if only the first house’s price was too high, \Delta p_1 = p_1 - \mathrm{E}[p_1] = \beta, then Chicagoans must have changed their preferences for 2 car garages:

(3)   \begin{equation*}     \begin{bmatrix} \Delta p_1 \\ \Delta p_2 \\ \Delta p_3 \end{bmatrix}      =     \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix}      =      \begin{bmatrix}        1 & 0 & 1 & 0 & 1 & 0 & 1        \\        0 & 1 & 1 & 0 & 0 & 1 & 1        \\        0 & 0 & 0 & 1 & 1 & 1 & 1      \end{bmatrix}     \begin{bmatrix}        \beta \\ 0 \\ \vdots \\ 0      \end{bmatrix} \end{equation*}

By contrast, if \Delta p_1 = \Delta p_2 = \Delta p_3 = \beta, then people must value walk-in closets more than they did a year ago.

Here’s the key point. The problem changes character at N_R = 3 observations. 3 sales is just enough information to answer 7 yes or no questions and rule out the possibility of no change: 7 = 2^3 - 1. N = 4 sales simply narrows your error bars around the exact value of \beta. N = 2 sales only allows you to distinguish between subsets of amenities. e.g., seeing just the first and second houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more… not which one.

Yet, the dimensionality in this toy example can be confusing. There is obviously something different about the problem at N_R = 3 observations, but there is still some information contained in the first N = 2 observations. e.g., even though you can’t tell exactly which attribute realized a shock, you can narrow down the list of possibilities to 2 attributes out of 7. If you just flipped a coin and guessed after seeing N = 2 transactions, you would have an error rate of 50{\scriptstyle \%}. This is no longer true in higher dimensions. i.e., even in the absence of any noise, seeing any fraction (1 - \alpha) \cdot N_R of the required observations for \alpha \in (0,1) will leave you with an error rate that is within a tiny neighborhood of 100{\scriptstyle \%} as the number of attributes gets large.

3. Non-Random Analysis

I start by exploring how the required number of observations, N_R, moves around as I increase the number of attributes in the setting where there is only K = 1 shock and the data matrix is non-random. Specifically, I look at the case where K = 1 and Q = 15. My goal is to build some intuition about what I should expect in the more complicated setting where the data \mathbf{X} is a random matrix. Here, in this simple setting, the ideal data matrix would be (4 \times 15)-dimensional and look like:

(4)   \begin{equation*} \underset{4 \times 15}{\mathbf{X}} =  \left[ \begin{matrix}  1 & 0 & 1 & 0 & 1 & 0 & 1  \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\  0 & 0 & 0 & 1 & 1 & 1 & 1 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \ \ \ \begin{matrix} 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1  \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1  \end{matrix} \right] \end{equation*}

where each column of the data matrix corresponds to a number q=1,2,\ldots,15 in binary.

Let S(N) be a function that eats N observed price changes and spits out the set of possible preference changes that might explain the observed price changes. e.g., if traders only see the 1st transaction, then they can only place the shock in 1 of 2 sets containing 8 attributes each:

(5)   \begin{align*} S(1) &=  \begin{cases} \{ 1, 3, 5, 7, 9, 11, 13, 15 \}         &\text{if } \Delta p_1 = \beta \\ \{ \emptyset, 2, 4, 6, 8, 10, 12, 14 \} &\text{if } \Delta p_1 = 0 \end{cases} \end{align*}

The 2nd transaction then allows traders to split each of these 2 larger sets into 2 smaller ones and place the shock in a set of 4 possibilities:

(6)   \begin{align*} S(2) =  \begin{cases} \{ \emptyset, 4, 8, 12 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 5, 9, 13 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 \end{bmatrix}^{\top} \\ \{ 2, 6, 10, 14 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta \end{bmatrix}^{\top}  \\ \{ 3, 7, 11, 15 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta \end{bmatrix}^{\top}  \end{cases} \end{align*}

With the 3rd transaction, traders can tell that the actual shock is either of 2 possibilities:

(7)   \begin{align*} S(3) =  \begin{cases} \{ \emptyset, 8 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 9 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & 0\end{bmatrix}^{\top} \\ \{ 2, 10 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & 0 \end{bmatrix}^{\top}  \\ \{ 3, 11 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & 0 \end{bmatrix}^{\top}  \\ \{ 4, 12 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & \beta \end{bmatrix}^{\top} \\ \{ 5, 13 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & \beta \end{bmatrix}^{\top} \\ \{ 6, 14 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & \beta \end{bmatrix}^{\top}  \\ \{ 7, 15 \} &\text{if }         \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & \beta \end{bmatrix}^{\top}  \end{cases} \end{align*}

The N_R = 4th observation then closes the case against the offending attribute.

Here’s the key observation. Only the absolute difference between (N_R - N) matters when computing the size of the output of S(N). If traders have seen N = (N_R - 1) transaction, then they can tell which subset of 2 = 2^1 attributes has realized a shock. If traders have seen N = (N_R - 2) transactions, then they can tell which subset of 4 = 2^2 attributes has realized a shock. If traders have seen N = (N_R - 3) observations, then they can tell which subset of 8 = 2^3 attributes has realized a shock. Thus, after seeing any number of observations N \leq N_R, traders can place the shock in a set of size 2^{N_R - N}. i.e., a trader has the same amount of information about which attribute has realized a shock in (i) a situation where N_R = 100 and he’s seen N = 99 transactions as in (ii) a situation where N_R = 3 and he’s seen N = 2 transactions.

The probability that traders select the correct attribute after seeing only N \leq N_R observations is given by 2^{-(N_R - N)} assuming uniform priors. Natural numbers are hard to work with analytically, so let’s suppose that traders observe some fraction of the required number of observations N_R. i.e., for some \alpha \in (0,1) traders see N = (1 - \alpha) \cdot N_R observations. We can then perform a change of variables 2^{- (N_R - N)} = e^{- \alpha \cdot \log(2) \cdot N_R} and answer the question: “How much does getting 1 additional observation improve traders’ error rate?”

(8)   \begin{align*} \frac{1}{N_R} \cdot \frac{d}{d\alpha}\left[ \, 2^{- (N_R - N)} \, \right] = - \log(2) \cdot e^{- \alpha \cdot \log(2) \cdot N_R} \end{align*}

I plot this statistic for N_R ranging from 100 to 800 below. When N_R = 100, a trader’s predictive power doesn’t start to improve until he sees 95 transactions (i.e., 95{\scriptstyle \%} of N_R); by contrast, when N_R= 800 a trader’s predictive power doesn’t start to improve until he’s seen N = 792 transactions (i.e., 99{\scriptstyle \%} of N_R). Here’s the punchline. As I scale up the original toy example from 7 attributes to 7 million attributes, traders effectively get 0 useful information about which attributes realized a shock until they come within a hair’s breadth of the signal recovery bound N_R. The opacity and recovery bounds are right on top of one another.

plot--prediction-error

4. Introducing Randomness

Previously, the matrix of attributes was strategically chosen so that the set of N observations that traders see would be as informative as possible. Now, I want to relax this assumption and allow the data matrix to be random with elements x_{n,q} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0,1):

(9)   \begin{align*} \Delta p_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{for each }  n = 1,2,\ldots,N \end{align*}

where \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) denotes idiosyncratic shocks affecting asset n in units of dollars. For a given triplet of integers (K,N,Q) with 0 < K < N < Q, I want to know whether solving the linear program:

(10)   \begin{align*} \widehat{\boldsymbol \beta} = \min \left\Vert {\boldsymbol \beta} \right\Vert_{\ell_1} \qquad \text{subject to} \qquad \mathbf{X} {\boldsymbol \beta} = \Delta \mathbf{p} \end{align*}

recovers the true \boldsymbol \beta when it is K-sparse. i.e., when \boldsymbol \beta has only K non-zero entries K = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}}. Since N < Q the linear system is underdetermined; however, if the level of sparsity is sufficiently high (i.e., K is sufficiently small), then there will be a unique solution with high probability.

First, I study the case where there is no noise (i.e., where \sigma_\epsilon^2 \searrow 0), and I ask: “What is the minimum number of observations needed to identify the true \boldsymbol \beta with probability (1 - \eta) for \eta \in (0,1) using the linear program in Equation (10)?” I remove the noise to make the inference problem as easy as possible for traders. Thus, the proposition below which characterizes this minimum number of observations gives a lower bound. I refer to this number of observations as the signal opacity bound and write it as N_O. The proposition shows that, whenever traders have seen N < N_O observations, I can make traders’ error rate arbitrarily bad (i.e., \eta \nearrow 1) by increasing the number of attributes (i.e., Q \nearrow \infty).

Proposition (Donoho and Tanner, 2009): Suppose \sfrac{K}{N} = \rho, \sfrac{N}{Q} = \delta, and N \geq N_0 with \rho, \delta \in (0,1). The linear program in Equation (10) will recover \boldsymbol \beta a fraction (1 - \eta) if the time whenever:

(11)   \begin{align*} N > 2 \cdot K \cdot \log\left( \frac{Q}{N} \right) \cdot \left( 1 - R(\eta;N,Q) \right)^{-1} \end{align*}

where R(\eta;N,Q) = 2 \cdot \sqrt{\sfrac{1}{N} \cdot \log\left( 4 \cdot \sfrac{(Q + 2)^6}{\eta} \right)}.

Next, I turn to the case where there is noise (i.e., where \sigma_\epsilon^2 > 0), and I ask: “How many observations do traders need to see in order to identify the true \boldsymbol \beta with probability 1 in an asymptotically large market using the linear program in Equation (10)?” Define traders’ error rate after seeing N observations as:

(12)   \begin{align*}   \mathrm{Err}[N] &= \frac{1}{{Q \choose K}} \cdot \sum_{\substack{\mathcal{K} \in \mathcal{Q} \\ |\mathcal{K}| = K}} \mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) \end{align*}

\mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) denotes the probability that the linear program in Equation (10) chooses the wrong subset of attributes (i.e., makes an error) given the true support \mathcal{K} and averaging over not only the measurement noise, {\boldsymbol \epsilon}, but also the choice of the Gaussian attribute exposure matrix, \mathbf{X}. Traders’ error rate is the weighted average of these probabilities over every shock set of size K. Traders identify the true {\boldsymbol \beta} with probability 1 in an asymptotically large market if:

(13)   \begin{align*}   \lim_{\substack{K,N,Q \to \infty \\ \sfrac{K}{Q} \to 0}} \mathrm{Err}[N] &= 0 \end{align*}

Thus, the proposition below which characterizes this number of observations gives an upper bound of sorts. I refer to this number of observations as the signal recovery bound and write it as N_R. i.e., the proposition shows that, whenever traders have seen N > N_R observations, they will be able to recovery \boldsymbol \beta almost surely no matter how large I make the market.

Proposition (Wainwright, 2009): Suppose K,N,Q \to \infty, \sfrac{K}{Q} \to 0, (N - K) \cdot \beta \to \infty, and \beta = \sfrac{1}{\sqrt{K}}, then traders can identify the true {\boldsymbol \beta} with probability 1 in an asymptotically large market if for some constant a > 0:

(14)   \begin{align*} N &> a \cdot K \cdot \log (\sfrac{Q}{K}) \end{align*}

The only cognitive constraint that traders face is that their selection rule must be computationally tractable. Under minimal assumptions a convex optimization program is computationally tractable in the sense that the computational effort required to solve the problem to a given accuracy grows moderately with the dimensions of the problem. Natarajan (1995) explicitly shows that \ell_0 constrained linear programming is NP-hard. This cognitive constraint is really weak in the sense that any selection rule that you might look up in an econometrics or statistics textbook (e.g., forward stepwise regression or LASSO) is going to be computationally tractable. After all, they have to be executed on computers.

5. Discussion

plot--opacity-vs-recovery-bound-absolute-gap

What is really interesting is that the signal opacity bound, N_O, and the signal recovery bound, N_R, basically sit right on top of one another when the market gets large just as you would expect from the analysis in Section 3. The figure above plots each bound on a log-log scale for varying levels of sparsity. It’s clear from the figure that the bounds are quite close. The figure below plots the relative gap between these 2 bounds:

(15)   \begin{align*} \frac{N_R - N_O}{N_R} \end{align*}

i.e., it plots how big the gap is relative to the size of the signal recover bound N_R. For each level of sparsity, the gap is shrinking as I add more and more attributes. This is an identical result as in the figure from Section 3: as the size of the market increases, traders learn next to nothing from each successive observation until they get within an inch of the signal recovery bound. The only difference here is that now there are an arbitrary number of shocks and the data matrix is random.

plot--opacity-vs-recovery-bound-relative-gap