Bias in Time-Series Regressions

1. Motivation

How persistent has IBM’s daily trading volume been over the last month? How persistent have Apple’s monthly stock returns been over the last $5$ years of trading? What about the US’s annual GDP growth over the last century? To answer these questions, why not just run an OLS regression,

(1) $\begin{align*} x_t = \widehat{\alpha} + \widehat{\rho} \cdot x_{t-1} + \widehat{\epsilon}_t, \end{align*}$

where $\widehat{\rho}$ denotes the estimated auto-correlation of the relevant data series? The Gauss-Markov Theorem says that an OLS regression will give a consistent unbiased estimate of the persistence parameter, $\rho$ , right?

Wrong.

Although OLS estimates are still consistent when using time-series data (i.e., they converge to the correct value as the number of observations increases), they are no longer unbiased in finite samples (i.e., they may be systematically too large or too small when looking at $T = 100$ rather than $T \to \infty$ observations). To illustrate the severity of this problem, I simulate data of lengths $T = 21$ , $60$ , and $100$ ,

(2) $\begin{align*} x_t = 0 + \rho \cdot x_{t-1} + \epsilon_t \qquad \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1), \end{align*}$

and estimate the simple auto-regressive model from Equation (1) to recover $\rho$ . The figure below shows the results of this exercise. The left-most panel reveals that, when the true persistence parameter approaches one, $\rho \nearrow 1$ , the bias approaches $-0.20$ , or $\sfrac{1}{5}$ of the true coefficient size. In other words, if you simulate a time series of $T=21$ data points using $\rho = 1$ , then you’ll typically estimate a $\widehat{\rho} = 0.80$ !

What is it about time series data that induces the bias? Why doesn’t this problem exist in a cross-sectional regression? How can it exist even when the true coefficient is $\rho = 0$ ? This post answers these questions. All of the code can be found here.

2. Root of the Problem

Here is the short version. The bias in $\widehat{\rho}$ comes from having to estimate the sample average of the time series:

(3) $\begin{align*} \widehat{\mu} &= \frac{1}{T} \cdot \sum_t x_t = \frac{1}{T} \cdot \sum_t \left( \frac{1 - \rho^{T - (t-1)}}{1 - \rho} \right) \cdot \epsilon_t. \end{align*}$

If you knew the true mean, $\mu$ , then there’d be no bias in $\widehat{\rho}$ . Moreover, the bias goes away as you see more and more data (i.e., the estimator is consistent) because your estimated mean gets closer and closer to the true mean, $\lim_{T \to \infty} (\widehat{\mu} - \mu)^2 = 0$ . Let’s now dig into why not knowing the mean of a time series induces a bias in OLS estimate of the slope.

Estimating the coefficients in Equation (1) means choosing the parameters $\widehat{\mu}$ and $\widehat{\rho}$ to minimize the mean squared error between the left-hand-side variable, $x_t$ , and the right-hand-side variable, $x_{t-1}$ :

(4) $\begin{align*} \min_{\{\widehat{\mu},\widehat{\rho}\}} \, \frac{1}{T} \cdot \sum_t \left( \, x_t - \left\{ \widehat{\mu} + \widehat{\rho} \cdot (x_{t-1} - \widehat{\mu}) \right\} \, \right)^2. \end{align*}$

Any parameter choice, $\widehat{\rho}$ , that minimizes this error must also satisfy the first-order condition below:

(5) $\begin{align*} 0 &= - \, \frac{1}{T} \cdot \sum_t \left( \, x_t - \left\{ \widehat{\mu} + \widehat{\rho} \cdot (x_{t-1} - \widehat{\mu}) \right\} \, \right) \cdot (x_{t-1} - \widehat{\mu}). \end{align*}$

Substituting in the true functional form, $x_t = \mu + \rho \cdot x_{t-1} + \epsilon_t$ , then gives:

(6) $\begin{align*} 0 &= - \, \frac{1}{T} \cdot \sum_t \left( \, \left\{\mu + \rho \cdot x_{t-1} + \epsilon_t \right\} - \left\{ \widehat{\mu} + \widehat{\rho} \cdot (x_{t-1} - \widehat{\mu}) \right\} \, \right) \cdot (x_{t-1} - \widehat{\mu}). \end{align*}$

From here, it’s easy to solve for the expected difference between the estimated slope coefficient, $\widehat{\rho}$ , and the true slope coefficient, $\rho$ :

(7) $\begin{align*} \mathrm{E}\!\left[ \, \widehat{\rho} - \rho \, \right] &= \mathrm{E} \left[ \, \frac{ \frac{1}{T} \cdot \sum_t ( \epsilon_t - \{\widehat{\mu} - \mu\} ) \cdot (x_{t-1} - \widehat{\mu}) }{ \frac{1}{T} \cdot \sum_t (x_{t-1} - \widehat{\mu})^2 } \, \right]. \end{align*}$

Note that $\widehat{\epsilon}_t = \epsilon_t - \{\widehat{\mu} - \mu\}$ is just the regression residual. So, this equation says that the estimated persistence parameter, $\widehat{\rho}$ , will be too high if big $\widehat{\epsilon}_t$ ‘s tend to follow periods in which $x_{t-1}$ is above its mean. Conversely, your estimated persistence parameter, $\widehat{\rho}$ , will be too low if big $\widehat{\epsilon}_t$ ‘s tend to follow periods in which $x_{t-1}$ is below its mean. How can estimating the time-series mean, $\widehat{\mu}$ , induce this correlation while knowing the true mean, $\mu$ , not?

Clearly, we need to compute the average of the time series given in Equation (3). For simplicity, let’s assume that the true mean is $\mu = 0$ and the initial value is $x_0 = 0$ . Under these conditions, each successive term of the time series is just a weighted average of shocks:

(8) $\begin{align*} x_t &= \sum_{s=1}^t \rho^{t-s} \cdot \epsilon_s. \end{align*}$

So, the sample average given in Equation (3) must contain information about future shock realizations, $\{\epsilon_t, \epsilon_{t+1}, \ldots, \epsilon_T\}$ . Consider the logic when the true persistence parameter is positive, $\rho > 0$ , and the true mean is zero, $\mu = 0$ . If the current period’s realization of $x_{t-1}$ is below the estimated mean, $\widehat{\mu}$ , then future $\epsilon_t$ ‘s have to be above the estimated mean by definition—otherwise, $\widehat{\mu}$ wouldn’t be the mean. Conversely, if the current period’s realization of $x_{t-1}$ is above the estimated mean, then future $\epsilon_t$ ‘s have to be below the estimated mean. As a result, the sample covariance between $(x_{t-1} - \widehat{\mu})$ and $(\epsilon_t - \widehat{\mu})$ must be negative:

(9) $\begin{align*} 0 > \frac{1}{T} \cdot \sum_t (\epsilon_t - \widehat{\mu}) \cdot (x_{t-1} - \widehat{\mu}). \end{align*}$

As a result, when the true slope parameter is positive, the OLS estimate will be biased downward.

3. Cross-Sectional Regressions

Cross-sectional regressions don’t have this problem because estimating the mean of the right-hand-side variable,

(10) $\begin{align*} \widehat{\mu}_x = \frac{1}{N} \cdot \sum_n x_n, \end{align*}$

doesn’t tell you anything about the error terms. For example, imagine you had $N$ data points generated by the following model,

(11) $\begin{align*} y_n = \mu_y + \beta \cdot (x_n - \mu_x) + \epsilon_n \qquad x_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(\mu_x, \sigma_x^2) \qquad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \sigma_{\epsilon}^2). \end{align*}$

where each observation of the right-hand-side variable, $x_n$ , is independently drawn from the same distribution. In this setting, the slope coefficient from the associated cross-sectional regression,

(12) $\begin{align*} y_n &= \widehat{\mu}_y + \widehat{\beta} \cdot (x_n - \widehat{\mu}_x) + \widehat{\epsilon}_n, \end{align*}$

won’t be biased because $\widehat{\mu}_x$ isn’t a function of any of the error terms, $\{\epsilon_1, \epsilon_2, \ldots, \epsilon_N\}$ :

(13) $\begin{align*} \left( \frac{1}{N} \cdot \sum_n x_n \right) \perp \epsilon_n. \end{align*}$

So, estimating the mean won’t induce any covariance between the residuals, $\widehat{\epsilon}_n = \epsilon_n - \{ \widehat{\mu}_y - \mu_y \}$ , and the right-hand-side variable, $x_n - \widehat{\mu}_x$ . All the conditions of the Gauss-Markov Theorem hold. If you only have a small number of observations, then $\widehat{\beta}$ may be a noisy estimate, but at least it will be unbiased.

4. Bias At Zero

One of the more interesting things about the slope coefficient bias in time series regressions is that it doesn’t disappear when the true parameter value is $\rho = 0$ . For instance, in the figure above, notice that the expected bias disappears at $\rho = -\sfrac{1}{3}$ and negative at $\rho = 0$ . Put differently, if you estimated the correlation between Apple’s returns in successive months and found a parameter value of $\widehat{\rho} = -0.03$ , then the true coefficient is likely $\rho = 0$ . In fact, Kendall (1954) derives an approximate expression for this bias when the true data generating process is an $\mathrm{AR}(1)$ :

(14) $\begin{align*} \mathrm{E}[ \, \widehat{\rho} - \rho \, ] &= - \, \frac{1 + 3 \cdot \rho}{T} \end{align*}$

A simple two-period example illustrates why this is so. Imagine a world where the true coefficient is $\rho = 0$ , and you see the pair of data points in the left panel of the figure below, with the first observation lower than the second. If $\mu = 0$ as well, then we have that:

(15) $\begin{align*} \widehat{\mu} = \sfrac{x_1}{2} + \sfrac{x_2}{2} = \sfrac{\epsilon_1}{2} + \sfrac{\epsilon_2}{2}. \end{align*}$

I plot this sample mean with the green dashed line. In the right panel of the figure, I show the distances $(x_1 - \widehat{\mu})$ and $(\epsilon_2 - \widehat{\mu})$ in red and blue respectively. Clearly, if the first observation, $x_1$ , is below the line, then the second observation is above the line. But, since $\rho = 0$ , the second observation is just $x_2 = \epsilon_2$ , so you will observe a negative correlation between $(x_1 - \widehat{\mu})$ and $(\epsilon_2 - \widehat{\mu})$ . $\widehat{\rho}$ will be downward biased.