Uncategorized – Research Notebook

Why max EPS Persists: Ba (AER 2026) Interpretation

March 29, 2026 by Alex

The question

Researchers currently take it for granted that CEOs should maximize PV[Shareholder Payouts]. Suppose we agree. Under this premise, max EPS is the wrong objective. It’s a misspecified model for how to run a firm. In work with Itzhak Ben-David, I show that EPS maximization solves the 3 core problems in corporate finance: capital structure, real investment, and payout policy.

How does max EPS survive? CEOs have access to textbooks, MBA programs, consultants, and analysts—all of which teach present-value logic. The data generated by decades of corporate decisions are available for everyone to examine. If max EPS is misspecified, why hasn’t it been abandoned?

Ba (2026) provides a formal theory of exactly this phenomenon: when and why misspecified models persist, even when decision-makers are open to switching and have access to infinite data. This note spells out the connection to EPS maximization.

Ba’s framework

An agent uses a subjective model $\theta$ to guide decisions. Each period $t$ , she chooses an action $a_t$ from a finite set $\mathcal{A}$ and observes an outcome $y_t$ drawn from the true (unknown) data-generating process $Q^*(\cdot | a_t)$ . Her model $\theta$ is a parametric family of predicted DGPs, $\{Q^\theta(\cdot | a, \omega)\}_{a \in \mathcal{A}, \omega \in \Omega^\theta}$ , where $\omega$ indexes the parameter space $\Omega^\theta$ . The model is correctly specified if some $\omega$ recovers $Q^*$ ; it is misspecified otherwise.

The agent holds a prior $\pi_0^\theta$ over $\Omega^\theta$ and updates beliefs via Bayes’ rule within the model. Crucially, she also considers a competing model $\theta'$ with its own parameter space $\Omega^{\theta'}$ and prior $\pi_0^{\theta'}$ . She compares models using the Bayes factor

(1) $\begin{equation*} \lambda_t = \frac{\ell_t(\theta')}{\ell_t(\theta)} \end{equation*}$

where $\ell_t(\theta) = \sum_{\omega \in \Omega^\theta} \pi_0^\theta(\omega) \ell_t(\theta, \omega)$ is the marginal likelihood of the data under model $\theta$ , and $\ell_t(\theta, \omega) = \prod_{\tau=0}^{t} q^\theta(y_\tau | a_\tau, \omega)$ is the likelihood conditional on parameter $\omega$ .

As new data rolls in, the agent updates her Bayes factor recursively

(2) $\begin{equation*} \lambda_t = \lambda_{t-1} \cdot \frac{\sum_{\omega' \in \Omega^{\theta'}} \pi_t^{\theta'}(\omega') \, q^{\theta'}(y_t | a_t, \omega')}{\sum_{\omega \in \Omega^\theta} \pi_t^\theta(\omega) \, q^\theta(y_t | a_t, \omega)} \end{equation*}$

If $\lambda_t > \alpha$ where $\alpha \geq 1$ is a switching threshold, the agent switches to $\theta'$ . If $\lambda_t < 1/\alpha$ , she switches back. The threshold $\alpha$ controls switching stickiness. A larger $\alpha$ requires stronger evidence to switch.

Ba (2026) notation	EPS vs. PV application
Initial model $\theta$	Max EPS
Competing model $\theta'$	Max PV[Shareholder Payouts]
Action set $\mathcal{A}$	Corporate decisions: leverage choice, project selection, payout policy
Outcome $y_t$	Observable corporate outcomes: EPS level, EPS growth, stock-price reaction, analyst response
True DGP $Q^*(\cdot \mid a)$	The actual mapping from corporate decisions to outcomes (determined by the full economic environment)
Parameter space $\Omega^\theta$	Parameters of the EPS model (earnings yield, interest rates)
Parameter space $\Omega^{\theta'}$	Parameters of the PV model (discount rates, growth rates, terminal values, risk premia, payout schedules)
Switching threshold $\alpha$	Institutional friction: retraining costs, compensation redesign, regulatory reporting norms, board inertia
Bayes factor $\lambda_t$	Cumulative evidence that PV logic fits the data better than EPS logic

Result 1: Endogenous data lets misspecified models survive forever

The theorem

Theorem 1 (Ba 2026, p. 16). Suppose $\alpha > 1$ . The following are equivalent:

Model $\theta$ is globally robust for at least one full-support prior.

Model $\theta$ is locally robust for at least one full-support prior.

There exists a p-absorbing self-confirming equilibrium (SCE) under model $\theta$ .

A self-confirming equilibrium under $\theta$ is a strategy $\sigma$ supported by a belief $\pi^\theta$ such that (i) $\sigma$ is myopically optimal given $\pi^\theta$ , and (ii) the model’s prediction matches the true DGP on the equilibrium path

(3) $\begin{equation*} q^\theta(\cdot \mid a, \omega) \equiv q^*(\cdot \mid a) \qquad \forall \, a \in \text{supp}(\sigma), \; \forall \, \omega \in \text{supp}(\pi^\theta) \end{equation*}$

The strategy is p-absorbing if a dogmatic $\theta$ -modeler eventually plays only actions in $\text{supp}(\sigma)$ .

The key insight: a misspecified model need not be globally correct. It only needs to be correct on the equilibrium path—for the actions it actually induces. Off-path misspecification is never revealed because the agent’s own actions determine which data are generated. This is why endogenous data is essential: Ba (2026, p. 16, fn. 19) notes that “in an exogenous-data environment, Theorem 1 implies that the sufficient and necessary condition for both local robustness and global robustness is that the model is correctly specified.”

P-absorbingness adds a dynamic requirement on top of the static SCE condition. It is not enough for an SCE to exist; the agent’s belief dynamics must actually converge to it. Ba’s Section 5.2 (Proposition 2, p. 24) shows that this convergence property depends on the direction of belief reinforcement. When beliefs and actions are complements (so that the bias feeds on itself), the dynamics are positively reinforcing and convergence to the SCE is guaranteed. Hence, the SCE is p-absorbing. When beliefs and actions are substitutes, the bias is self-correcting. Dynamics may oscillate and fail to converge. A SCE exists but is not p-absorbing, and the misspecified model is not robust.

Application to EPS

When a CEO maximizes EPS, her decisions shape the observable corporate outcomes. The EPS model’s predictions are tested only against data generated by EPS-driven actions. Predictions about actions a CEO never takes are never tested.

Leverage. An EPS maximizer borrows when $\mathrm{EY} > \mathrm{i}$ (earnings yield exceeds the interest rate) and uses the proceeds to retire shares. This raises EPS mechanically. The outcome the CEO observes is: EPS went up, the stock price did not collapse, analysts applauded the “accretive” transaction. The EPS model’s prediction that the transaction would be good because it is accretive is confirmed by the data the decision itself generated. The CEO does not observe the counterfactual: what would’ve happened under the PV-optimal leverage choice given frictions.

Investment. An EPS maximizer uses $\mathrm{HR} = \min\{\mathrm{EY}, \, \mathrm{i}, \, \mathrm{rf}\}$ as the hurdle rate, not the WACC. She rejects projects with positive NPV but negative first-year EPS impact (dilutive projects) and accepts projects with negative NPV but positive first-year EPS impact (accretive projects). The observed outcome: EPS did not fall, the project looks like it “worked.” The NPV of rejected projects is never observed.

Payout. An EPS maximizer buys back stock whenever buybacks offer a higher yield than investing cash ( $\mathrm{EY} > \mathrm{CY}$ ) rather than evaluating the NPV of the buyback. The observed outcome: EPS went up, the market reacted positively to the announcement. The PV counterfactual (could the cash have been better deployed elsewhere?) is off-path.

Let $\sigma^{EPS}$ be the strategy induced by max EPS, and let $\hat{\omega}$ be a parameter value in the EPS model under which the predicted outcome distribution matches $Q^*(\cdot | a)$ for all $a \in \text{supp}(\sigma^{\mathrm{EPS}})$ . Then $\sigma^{\mathrm{EPS}}$ is an SCE under the EPS model. This is plausible because the EPS model does not mispredict the direction of EPS changes from leverage, buybacks, or accretive acquisitions. It correctly predicts that borrowing at $\mathrm{i} < \mathrm{EY}$ raises EPS, that using cash for buybacks at $\mathrm{EY} > \mathrm{CY}$ raise EPS, and so on. What it gets wrong is the welfare interpretation: whether these EPS changes correspond to value creation. But welfare is not directly observed in $y_t$ ; what is observed are EPS changes, stock-price reactions, and analyst ratings, all of which are consistent with the EPS model’s on-path predictions.

Moreover, the EPS model’s feedback dynamics are positively reinforcing in the sense of Ba’s Proposition 2: EPS-driven decisions raise EPS, which validates the model, which strengthens conviction, which leads to more EPS-driven decisions. This positive feedback ensures that the SCE is p-absorbing. Contrast this with the dynamics facing a CEO who switches to PV logic: she accepts a dilutive acquisition, EPS falls in the short run, analysts downgrade, the stock price drops, and the PV model appears to have failed, creating pressure to revert. The transition to the correct model generates short-run data that seem to disconfirm it. By Theorem 1, the existence of a p-absorbing SCE under the EPS model is sufficient for it to be globally robust. Hence, EPS maximization can persist against any competitor, including PV logic, with infinite data.

Result 2: Concise models can be more robust than correct ones

The theorem

Theorem 2 (Ba 2026, p. 19). Suppose $\alpha > 1$ and model $\theta$ has no traps. Then:

Model $\theta$ is globally robust at prior $\pi_0^\theta$ if and only if $\pi_0^\theta(C^\theta) \geq 1/\alpha$ .

Model $\theta$ is locally robust at all full-support priors if and only if $C^\theta \neq \emptyset$ .

$C^\theta$ is the set of consistent parameters: those $\omega$ for which the pure belief $\delta_\omega$ supports a p-absorbing SCE. The model’s prediction under $\omega$ matches the true DGP at every action in the equilibrium strategy’s support.

The condition $\pi_0^\theta(C^\theta) \geq 1/\alpha$ links three forces: the model’s structure (which determines $C^\theta$ ), the agent’s prior (which determines how much mass falls on $C^\theta$ ), and the switching threshold (which sets the bar). Prior tightness and switching stickiness are substitutes: a higher $\alpha$ lowers the bar for prior concentration, and a tighter prior lowers the bar for stickiness. Any asymptotically accurate model can be globally robust at a given prior, provided switching is sufficiently sticky.

The critical implication: correctly specified models are not necessarily more robust than misspecified ones. A misspecified model with a smaller parameter space $|\Omega^\theta|$ can satisfy the tightness condition more easily. Under an ignorance prior (uniform over $\Omega^\theta$ ), each parameter receives weight $1/|\Omega^\theta|$ . For a model where every parameter is consistent ( $C^\theta = \Omega^\theta$ ), the tightness condition is automatically satisfied at any $\alpha > 1$ , regardless of the prior—the model is unconditionally globally robust. But a correctly specified model with a large parameter space needs a correspondingly tight prior to be globally robust, and under a uniform prior it may fail. Ba (2026, p. 4): “simple misspecified models equipped with entrenched priors can be more robust than complex correctly specified models.”

In the media-bias application (Section 5.1, Proposition 1), Ba makes this concrete: a two-state misspecified model $\hat{\theta}$ is globally robust at all priors and all $\alpha \geq 1$ , while the correctly specified three-state model $\theta$ is globally robust only if $\pi_0^\theta(\omega^M) \geq 1/\alpha$ . The misspecified model permanently replaces the correct one with arbitrarily high probability as the prior on the extreme states increases.

Application to EPS

The EPS model has a small parameter space. For any given decision (borrow or not, invest or not, buy back or not), it requires the CEO to know essentially two things: the earnings yield $\mathrm{EY} = \frac{\mathbb{E}[\mathrm{EPS}]}{\mathrm{Price}}$ and the relevant financing cost (interest rate $\mathrm{i}$ or risk-free rate $\mathrm{rf}$ ). The decision rule is a direct comparison: act if and only if $\mathrm{EY} > \mathrm{HR}$ , where $\mathrm{HR} = \min\{\mathrm{EY}, \, \mathrm{i}, \, \mathrm{rf}\}$ .

The PV model requires knowledge of a much larger parameter space $\Omega^{\theta'}$ : the risk-free rate, market risk premium, firm beta (or multi-factor betas), the project-specific risk adjustment, the terminal growth rate, the expected path of future cash flows, and the probability distribution over states of the world.

Under Theorem 2, the prior tightness condition for global robustness is $\pi_0^\theta(C^\theta) \geq 1/\alpha$ . For the EPS model, if $C^\theta$ encompasses most or all of $\Omega^\theta$ (because the model is consistent for the small set of parameters it uses), then $\pi_0^\theta(C^\theta)$ is close to 1 and the condition is satisfied for any $\alpha > 1$ . The EPS model may be unconditionally globally robust.

For the PV model, even though it is correctly specified ( $C^{\theta'} \neq \emptyset$ ), the prior mass is spread across a large parameter space. Under a diffuse prior, $\pi_0^{\theta'}(C^{\theta'})$ may be small. The PV model is locally robust at all priors (by Theorem 2(ii)), but it is globally robust only if $\pi_0^{\theta'}(C^{\theta'}) \geq 1/\alpha$ . With large $|\Omega^{\theta'}|$ and diffuse prior, this can fail.

Moreover, switching stickiness in corporate settings is very large. Compensation contracts are tied to EPS targets. Analyst coverage is organized around EPS estimates, consensus forecasts, and PE multiples. Regulatory reporting (GAAP earnings) makes EPS the most salient and auditable metric, while PV calculations involve subjective inputs (discount rates, growth assumptions) that are harder to audit and verify. Board education is required to shift from a direct comparison (“is this accretive?”) to a multi-parameter model (“what is the NPV at the appropriate risk-adjusted discount rate?”). All of this amounts to a very high $\alpha$ , which further lowers the bar for the prior tightness that the EPS model must satisfy.

Result 3: Even slight switching friction is enough

The theorem

Theorem 3 (Ba 2026, p. 21). Suppose model $\theta$ has no traps and $\alpha = 1$ . Then model $\theta$ is locally or globally robust at any full-support prior $\pi_0^\theta$ if and only if $C^\theta = \Omega^\theta$ .

When switching is non-sticky ( $\alpha = 1$ ), local and global robustness coincide, robustness at some prior is equivalent to robustness at all priors, and both hold only when every parameter in the model is consistent. This is an extreme demand: the model must be correct for every DGP it entertains, not just on the equilibrium path. Only a model with full prior tightness ( $C^\theta = \Omega^\theta$ ) can survive frictionless comparison.

The set of robust models shrinks discontinuously at $\alpha = 1$ . For any $\alpha > 1$ , models with $C^\theta \neq \Omega^\theta$ can be robust (provided the prior tightness condition is met). At $\alpha = 1$ , they cannot. Ba (2026, p. 21): “the set of locally robust models and supporting priors shrinks discontinuously at $\alpha = 1$ , which highlights how stickiness helps more misspecified models persist.”

The mechanism: at $\alpha = 1$ , there always exists a nearby competing model that fits the data slightly better than the initial model on some dimension. Because there is no switching friction, this marginal improvement is sufficient to trigger a switch. The proof constructs such a competing model by preserving most DGPs in $\theta$ while slightly improving the accuracy of one DGP associated with a parameter in $\Omega^\theta \setminus C^\theta$ .

Application to EPS

Theorem 3 clarifies that the persistence of max EPS depends on switching friction being positive—but the required friction can be arbitrarily small. For any $\alpha > 1$ (even $\alpha = 1.01$ ), the EPS model can be globally robust provided the prior tightness condition is met. The discontinuity at $\alpha = 1$ means that even minimal institutional friction (a small cost of retraining, a slight reluctance to abandon a familiar framework) is qualitatively different from zero friction.

This matters because it addresses a potential objection: “surely CEOs could switch to PV logic if they wanted to; there’s no real barrier.” Ba’s result says that even a negligible barrier is enough, as long as it is positive. The EPS model does not need an enormous moat to survive. It needs (i) a p-absorbing SCE (Result 1), (ii) sufficient prior concentration on consistent parameters (Result 2), and (iii) any positive switching friction at all (Result 3). The first two conditions are structural properties of the EPS model. The third is almost trivially satisfied in any real institution.

Conversely, Theorem 3 identifies the knife-edge case where EPS would be displaced: a world with literally zero switching costs ( $\alpha = 1$ ) and a PV model that is a slight local improvement. In practice, this would correspond to an environment where CEOs face no career risk from short-term EPS misses, no analyst pressure around quarterly earnings, and no cognitive cost of estimating multi-parameter discount rates. These conditions do not describe any real-world setting.

Summary

Ba (2026) provides a formal framework for understanding when misspecified models persist despite competition from correctly specified alternatives. Applied to the EPS-vs.-PV question, the theory identifies three reinforcing mechanisms, one per main result.

Result	Mechanism	Application to EPS
Theorem 1	Misspecified model is robust iff it admits p-absorbing SCE; endogenous data insulates on-path predictions from off-path errors	Accretive actions generate data that confirm max EPS model; positive feedback dynamics ensure convergence to SCE
Theorem 2	Global robustness requires $\pi_0^\theta(C^\theta) \!\geq\! 1/\alpha$ ; concise models concentrate priors; stickiness and prior tightness substitutes	EPS model’s small parameter space makes tightness condition easy to satisfy; minor frictions and reporting norms matter
Theorem 3	Set of robust models shrinks discontinuously at $\alpha = 1$ ; any positive friction qualitatively expands what can persist	Even minimal switching costs suffice for EPS to survive; knife-edge $\alpha = 1$ case (zero friction) does not describe real world

The punchline: even if we take as given that maximizing PV[Shareholder Payouts] is the correct objective, the Ba (2026) framework gives formal reasons—grounded in Bayesian learning theory—for why maximizing EPS can persist indefinitely. The misspecified model is not merely sticky due to inertia or ignorance. It is robust in a precise sense: it admits a self-confirming equilibrium, its directness concentrates prior beliefs, the data it generates through the CEO’s own actions provide continuous apparent validation, and even minimal institutional friction is enough to protect it. These forces can be strong enough that the correct model is permanently abandoned.

Caveat: Is max PV[Shareholder Payouts] actually the correct model?

Everything above assumes that maximizing PV[Shareholder Payouts] is the correctly specified model. It’s the true DGP against which max EPS is judged misspecified. Ba’s framework requires us to designate one model as correct and ask whether the other persists. We chose PV as the correct model because that is what finance theory prescribes. But this assumption deserves scrutiny.

Shareholders do not get to spend corporate earnings. Earnings are an accounting construct; they accrue to the firm, not to the shareholder’s bank account. A dollar of EPS that is retained and reinvested never reaches the shareholder at all. In that sense, maximizing EPS is maximizing a fiction—a number that does not correspond to any cash flow the shareholder actually receives.

But PV[Shareholder Payouts] has a parallel problem. Shareholders do not get to spend the present discounted value of a dollar they expect to receive in 20 years. You cannot eat risk-adjusted returns. The “present value” of a distant payout is a mathematical object, not cash in hand. And yet these distant, heavily discounted payouts are the primary drivers of valuation in the PV framework.

The Gordon growth model makes this concrete. Under the standard formulation, an asset’s price equals expected cash flow next year times a forward-looking multiple

(4) $\begin{equation*} \mathrm{Price} = \mathbb{E}[\mathrm{CF}] \times \bigg( \frac{1}{\mathrm{r} - \mathrm{g}} \bigg) \end{equation*}$

For typical parameter values ( $\mathrm{r} \approx 10\%$ , $\mathrm{g} \approx 5\%$ ), the multiple is roughly $\big(\frac{1}{10\%-5\%}\big) = 20\times$ . But $\big(\frac{1}{\mathrm{r}-\mathrm{g}}\big)$ is also the Macaulay duration of the cash flow stream in years. So the “typical” dollar of present value corresponds to a payout roughly two decades in the future. The PV framework asks the CEO to make decisions today based on the risk-adjusted value of money that shareholders will not receive for 20 years. This is money that never appears on a financial statement, whose value depends on estimates of discount rates and growth rates that are themselves deeply uncertain.

This does not mean PV logic is wrong. It means that both models involve abstractions, and the question of which abstraction is “correct” is less obvious than textbook finance suggests. EPS is a fiction because earnings are not payouts. PV is a fiction because present values are not cash. The Ba (2026) framework shows that even if we grant the PV model the status of correct specification, the EPS model can persist indefinitely. But if we take seriously the possibility that neither model is unambiguously correctly specified, then the persistence of max EPS becomes even less surprising. In Ba’s terms, we may not be in a world where a correctly specified competitor exists at all, in which case the question is not whether EPS will be abandoned but which misspecified model proves more robust. Regardless, history has shown that EPS wins.

Behavioral finance and corporate finance are both organized in the exact same way

February 4, 2023 by Alex

Behavioral finance and corporate finance are both organized in the exact same way. Neither is based on a grand unified theory. Instead, both fields proceed by looking for deviations from a benchmark model. The behavioral-finance literature is a list of ways to violate market efficiency. The corporate-finance literature is a collection of ways to relax the assumptions needed for capital-structure irrelevance. Same setup.

One reason for writing this post is to spread the word about this symmetry. I don’t think it’s widely appreciated. Occasionally I’ll mention it to somebody. When I do, I usually get either a blank stare or a look of sudden recognition. I’d like to live in a world where the comment gets a bland nod in agreement.

I also think it’s noteworthy how differently each field is viewed within the profession given that both fields are calling plays from the same playbook. True, behavioral finance has not proposed an alternative to market efficiency. But, then again, ain’t nobody asking corporate researchers to come up with an alternative to ModiglianiMiller58. Highlighting this disconnect is the other reason for writing this post.

Behavioral finance

Behavioral economists explain market outcomes by pointing to deviations from market efficiency—i.e., the idea that “security prices fully reflect all available information”. In John Cochrane’s words: “Informational efficiency is a natural consequence of competition, relatively free entry, and low costs of information in financial markets. If there is a signal, not now incorporated in market prices, that future values will be high, competitive traders will buy. In doing so, they bid the price up until it fully reflects the available information.”

If there’s a signal that an asset’s future payout will be high, then the present discounted value of that asset’s payout will go up—i.e., $\Exp[ \, \textit{Discount Rate} \times \textit{Future Payout} \, ]$ will increase. If the asset’s current price doesn’t increase accordingly, any trader who sees the signal could profit by buying a share, $\Delta = +1$ :

$\begin{equation*} \Big( \, \underbrace{\Exp[ \, \textit{Discount Rate} \times \textit{Future Payout} \, ]}_{\substack{\text{Present\phantom{j}discounted\phantom{j}value\phantom{j}of} \\ \text{asset w/ same future payout}}} - \textit{Current Price} \, \Big) \times \Delta \end{equation*}$

In the process, the trader will push up the current price until there’s no longer any benefit to continuing the trade. And we’d see the opposite pattern with $\Delta = -1$ in a world where traders saw a negative signal.

Given this benchmark, behavioral economists look for situations where there appears to be a persistent uncorrected pricing error. e.g., under the benchmark of market efficiency, it should not be possible to find situations where $\Exp[ \textit{Discount Rate} \times \textit{Future Payout} \, ] > \textit{Current Price}$ without traders taking action, $\Delta = 0$ . However, JegadeeshTitman93 document that the 30% of stocks with the highest past returns (past winners) tend to have higher future returns than the 30% of stocks with the lowest past returns (past losers). In a world where markets were efficient, traders would immediately bid up the prices of past winners until this pricing error disappeared. So it seems like real-world traders must be making some sort of behavioral error.

Corporate finance

ModiglianiMiller58 taught us that, if all the following assumptions hold, then a firm’s choice of leverage has no affect on its market valuation. (A1) Investors and firms can trade the same set of correctly priced securities. (A2) Investors and firms are taxed in the same way. (A3) Investors and firms face no transaction costs or portfolio restrictions. (A4) There are no bankruptcy costs or costs to issuing new securities. (A5) A firm’s choice of leverage doesn’t directly affect its future cash flows. And finally, (A6) firm leverage doesn’t signal additional information to investors about these cash flows. Firms clearly spend a lot of time worrying about their capital structure. And corporate researchers explain their decisions by pointing out ways that the above assumptions are violated in the real world. That’s the organizing principle behind this literature.

The streamlined proof given in ModiglianiMiller69 is based on a homemade-leverage argument. Suppose there are two firms with different capital structures but identical cash flows. The first firm is unlevered while the second firm as issued debt. In a world where all the above assumptions hold, an investor could effectively lever up the unlevered firm’s cash flows himself by constructing a portfolio that’s long the unlevered firm and short the debt issued by the levered firm:

$\begin{equation*} - \Big( \, \underbrace{[\textit{Unlevered Firm's Value} - \textit{Value of Debt Issuance}]}_{\substack{\text{Cost of building a portfolio that buys unlevered firm and} \\ \text{shorts the debt issued by otherwise identical levered firm}}} - \textit{Equity Value of Levered Firm} \, \Big) \times \Delta \end{equation*}$

If there’s any gap between the cost of this portfolio and the equity value of the levered firm, an investor could earn arbitrage profits given the assumptions above. Since both have identical future cash flows, the investor should continue buying the one with the lower current price until there’s no more price difference.

“The entire development of corporate finance since 1958—the publication date of the first MM article—can be seen and described essentially as the sequential (or simultaneous) relaxation of the assumptions listed before.” e.g., if corporate debt is taxed differently than the short positions of individual investors, then the homemade-leverage argument breaks down. Once assumption A2 has been violated, capital structure is no longer irrelevant. In a world where corporations get preferred tax treatment, firms should optimally choose to issue debt since it’d be more expensive for individual investors to homebrew this leverage themselves.

Nobody would say…

Given how I’ve described behavioral finance and corporate finance above, it’s obvious that the two fields are organized in the exact same way. Researchers in each field try to make sense of empirical regularities by pointing to specific deviations from their own respective benchmark. In fact, Mark Rubinstein argues that ModiglianiMiller58’s “real and enduring contribution was to point others in the direction of arbitrage reasoning.” And this sort of reasoning lies at the heart of the Efficient Market Hypothesis in Fama70.

That being said, the behavioral-finance and corporate-finance literatures clearly emphasize different things about their respective benchmark models. ModiglianiMiller58 weren’t trying to argue that the assumptions needed for capital-structure irrelevancy were realistic. As Merton Miller later wrote: “We first had to convince people (including ourselves!) that there could be any conditions, even in a ‘frictionless’ world, where a firm would be indifferent between issuing different kinds of securities.”

By contrast, market efficiency is treated as a good first approximation to the real world. While Franco Modigliani and Merton Miller didn’t think that capital structure was actually irrelevant in the real world, Eugene Fama actively defended the Efficient Market Hypothesis. e.g., Fama98 writes: “There is a developing literature… arguing that stock prices adjust slowly to information… It is time, however, to ask whether this literature, viewed as a whole, suggests that efficiency should be discarded. My answer is a solid no.”

This is fine. But the parallel structures of behavioral finance and corporate finance clearly put the lie to a common criticism of the behavioral literature. It’s often argued that, to be a honest-to-goodness scientific discipline, the behavioral-finance literature needs to offer a single coherent alternative model to challenge the Efficient Market Hypothesis. e.g., later in Fama98, it’s claimed that behavioral economists “rarely test a specific alternative to market efficiency… This is unacceptable… Following the standard scientific rule, however, market efficiency can only be replaced by a better specific model of price formation.”

That is nonsense. It’s a criticism that applies equally well to the corporate-finance literature, which has not produced a better specific model than the one in ModiglianiMiller58. However, no one would claim that Jean Tirole’s textbook is unscientific. Overturning ModiglianiMiller58 isn’t the point of corporate-finance research. Overturning the Efficient Market Hypothesis isn’t the point of behavioral-finance research. In both cases, the point is to provide good explanations for how the real-world works. If progress is fastest when researchers organize their thinking relative to a benchmark model, then so be it.

Asset-pricing models as theories of good synthetic controls

January 18, 2023 by Alex

In 1988, California passed a major piece of tobacco-control legislation called Proposition 99. This bill increased the tax on cigarettes by $0.25 a pack and triggered a wave of bans on smoking indoors throughout the state. After the bill was passed in California, it became more expensive to smoke in California and there were fewer places to do so.

It makes sense that Prop 99 could have caused lots of people in California to stop smoking. And, consistent with this hypothesis, AbadieDiamondHainmueller2010 found that per capita cigarette consumption in California fell by around 40 packs per year from 1985 to 1995. In 1985, the typical Californian bought 100 packs per year. By 1995, the average Californian bought fewer than 60 packs per year.

But was the effect causal? Did the passage of Prop 99 cause per capita cigarette consumption in California really drop by 40 packs per year? To answer this question, you need to know how many packs of cigarettes each Californian would have bought in 1995 had Prop 99 not been passed.

It’s not obvious how you should compute this counterfactual. You can’t just assume that, in the absence of Prop 99, cigarette consumption in California would have been the same in 1995 as it was in 1985. The popularity of smoking has been falling over time throughout the country. You also can’t naively compare cigarette consumption in California in 1995 to that of a neighboring state, like Nevada, in the same year. People in Nevada are more likely to partake in all sorts of vices (smoking, drinking, gambling, etc). Comparing per capita cigarette sales in California to that of Nevada in 1995 will tend to overstate the effect of Prop 99.

But, what if rather than using just Nevada in 1995 as your stand-in for California sans Prop 99, you instead used a composite Frankenstate that has the same observable characteristics as California. e.g., people in Nevada may be much more likely to smoke, drink, and gamble relative to people in California, but people in Utah are much less likely to do all of those things than Californians. So a weighted average of per capita cigarette consumption in Nevada and Utah in 1995 might represent a good synthetic control for California.

This post first outlines the idea behind using a synthetic control. Then, I make a connection between literatures: when an asset-pricing researcher computes a stock’s abnormal return by subtracting off the return of a replicating portfolio with the same risk exposures, he’s using a synthetic control. In fact, this is exactly what the OG synthetic control paper does! AbadieGardeazabal2003 computes abnormal returns for Basque companies relative to the CAPM and the FamaFrench1993 three-factor model. I wrap up by pointing out some interesting takeaways from this connection for both asset-pricing researchers and metrics folks.

Problem setup

Here’s the canonical synthetic-control problem. Imagine that you’ve got data on how much people spend on smoking and drinking in three different states, $n \in \{ \texttt{CA}, \, \texttt{NV}, \, \texttt{UT} \}$ , in two particular years, $t \in \{ \texttt{1985}, \, \texttt{1995} \}$ . For simplicity, I’m going to talk about Prop 99 as a policy that banned indoor smoking outright:

$\begin{equation*} \mathit{IndoorSmokingPolicy}_{n,t} = \left( \begin{array}{r|cc} & \texttt{1985} & \texttt{1995} \\ \hline \texttt{CA} & \mathtt{\emptyset} & \texttt{Ban} \\ \texttt{NV} & \mathtt{\emptyset} & \mathtt{\emptyset} \\ \texttt{UT} & \mathtt{\emptyset} & \mathtt{\emptyset} \end{array} \right) \end{equation*}$

Thus, you have one state-year observation with a smoking ban in place and five without one.

Let $\mathit{CigSales}_{n,t}(p)$ denote the number of packs bought by the average person in state $n$ during year $t$ given the prevailing indoor smoking policy $p \in \{ \mathtt{\emptyset}, \, \texttt{ban} \}$ . You want to know how Prop 99 affected cigarette sales:

$\begin{equation*} \underbrace{\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})}_{\substack{\text{\phantom{j}observed\phantom{j}} \\ \text{outcome}}} - \underbrace{\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})}_{\substack{\text{hypothetical} \\ \text{counterfactual}}} = \text{causal effect of Prop 99 on cigarette sales} \end{equation*}$

The first term, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})$ , is the per capita cigarette sales observed in California during 1995 after Prop 99 had been implemented. The second term, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ , reflects packs per person during 1995 in an alternative world where everything is the same except that Prop 99 was never passed.

The key empirical challenge rests on the fact that, while I can observe cigarette sales for the year 1995 in California where Prop 99 has already been passed, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})$ , I cannot directly observe cigarette sales in a version of 1995 California where Prop 99 didn’t go into law, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ . This counterfactual world is a hypothetical scenario. It never happened. The challenge is to come up with some stand-in value for $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ based on the data that I can observe in other states.

So, you want to think about a data-generating process where there’s a potential effect coming from a one-time policy change AND a bunch of things that affect statewide cigarette sales during normal times:

$\begin{equation*} \mathit{CigSales}_{n,t}(p) = \!\!\! \underbrace{\alpha \cdot 1_{\{p= \texttt{Ban}\}}}_{\substack{\text{causal\phantom{j}effect of one-} \\ \text{time policy change}}} \!\!\! + \underbrace{\mu_t + \lambda \cdot X_n + \varepsilon_{n,t}}_{\substack{\text{determinants of cigarette} \\ \text{sales during normal times}}} \qquad \qquad \varepsilon_{n,t} \overset{\scriptscriptstyle \text{IID}}{\sim} \mathrm{Normal}(0, \, \sigma^2) \end{equation*}$

You want to know whether the introduction of Prop 99, which is captured by the $1_{\{p= \texttt{Ban}\}}$ term, had an effect on cigarette sales in California. i.e., you want to know whether $\alpha < 0$ . If Prop 99 had no effect, then $\alpha = 0$ .

The remaining determinants of statewide cigarette sales during normal times are important because they dictate which observables might be a good stand-in for the counterfactual version of California in 1995 where Prop 99 was never passed as illustrated in the interactive figure below. The left panel depicts the number of cigarette packs purchased by an average resident in each of your three states during 1985 (y axis) as a function of liquor consumption in that state (x axis). The right panel shows the same thing but for 1995. The solid circles represent observed values of per capita annual cigarette sales. The dotted circle represents the counterfactual value for California in a world where Prop 99 was not passed ( $p = \mathtt{\emptyset}$ ).

$X_n$ represents liquor consumption in state $n$ . People in Nevada are more likely to spend money on all sorts of vices (smoking included) than people in California. $X_n$ is a proxy for this statewide predisposition. So, if you ignore this background variable and naively compare cigarette purchases in California to that in Nevada during 1995, then it’ll look like Prop 99 had an outsized effect. People in Nevada purchase an extra $15$ packs per year relative to California in the figure. Thus, using Nevada during 1995 as your counterfactual observation would cause you to overstate the true causal effect of Prop 99 by $15$ packs per person annually.

$\mu_t$ is the average per capita cigarette sales during year $t$ across all states. Cigarette sales have been falling over time, so $\mu_{\texttt{95}} < \mu_{\texttt{85}}$ . However, in its initial configuration, cigarette sales are constant over time in the figure above, $\Delta \mu = (\mu_{\texttt{95}} - \mu_{\texttt{85}}) = 0$ . If that were the case, then you could use per capita cigarette sales in California during 1985 as your counterfactual observation. But, by moving the $\Delta \mu$ slider, you can see how a downward time-series trend in cigarette sales would cause you to again overstate the effect of Prop 99:

$\begin{equation*} \Exp\big[ \, \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{95}}(\texttt{Ban})}_{\text{observed}} - \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{85}}(\mathtt{\emptyset})}_{\text{observed}} \, \big] = \Delta \mu + \alpha \end{equation*}$

Synthetic control

Because smoking has been growing less and less popular over time, you want to create the counterfactual for cigarette sales in California during 1995 using contemporaneous data from other states. But you also recognize that no other state is a perfect doppelganger for California. People in Nevada are more likely to smoke, drink, and gamble relative to people in California. Utah residents are less likely to do all of those things than Californians. So why no do the obvious thing and average these two values?

For conreteness, suppose that the drinking rate in California are exactly halfway in between the rates in Utah and Nevada:

$\begin{equation*} X_{\texttt{CA}} = (1/2) \cdot X_{\texttt{UT}} + (1/2) \cdot X_{\texttt{NV}} \end{equation*}$

In practice, the weights wouldn’t be exactly $(1/2)$ . But you could estimate these values using your 1985 data. And you could use these weights to construct a Voltron-esque counterfactual for cigarette sales in California during 1995 out of the contemporaneous values for Utah and Nevada:

$\begin{equation*} \begin{split} \widehat{\mathit{CigSales}}_{\texttt{CA}, \texttt{95}}(\mathtt{\emptyset}) &= (1/2) \cdot \overbrace{\mathit{CigSales}_{\texttt{UT},\texttt{95}}(\mathtt{\emptyset})}^{\text{observed}} + (1/2) \cdot \overbrace{\mathit{CigSales}_{\texttt{NV},\texttt{95}}(\mathtt{\emptyset})}^{\text{observed}} \\ &= \mu_{\texttt{95}} + \lambda \cdot X_{\texttt{CA}} + (1/2) \cdot (\varepsilon_{\texttt{UT},\texttt{95}} + \varepsilon_{\texttt{NV},\texttt{95}}) \end{split} \end{equation*}$

Since it’s made out of observations from 1995, this synthetic control is not confounded by the nationwide drop in cigarette sales from 1985 through 1995. By matching California’s alcohol sales, $X_{\texttt{CA}}$ , this synthetic control also accounts for persistent differences in cigarette sales across states due to differing propensities to partake in all vices. And, if we’ve done everything correctly, then we can compute

$\begin{equation*} \Exp\big[ \, \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{95}}(\texttt{ban})}_{\text{observed}} - \underbrace{\widehat{\mathit{CigSales}}_{\texttt{CA}, \texttt{95}}(\mathtt{\emptyset})}_{\text{calculated}} \, \big] = \alpha \end{equation*}$

where $\alpha$ denotes the true causal effect of Prop 99 on annual per capita cigarette sales in California.

Risk adjustment

The synthetic control for cigarette sales in California during 1995 was a weighted average of cigarette sales in Nevada and Utah where the weights were chosen to replicate California’s value of $X_{\mathtt{CA}}$ . While this approach has some advantages, it’s also somewhat unsatisfying in that there’s no real physical analog to the counterfactual it produces. There’s no process by which you can take a weighted average of Utah and Nevada. This is purely a statistical construct.

The key insight in this post is that, when an asset-pricing researcher computes a risk-adjusted return for some asset relative to particular model, he’s using this same synthetic control methodology. And in an asset-pricing context, there’s a clear physical analog to the resulting counterfactual. The synthetic control represents a portfolio of the underlying assets with appropriately chosen portfolio weights. e.g., in the case of the CAPM, a synthetic control observation for a particular asset is a replicating portfolio with weights chosen so that it has the exact same market beta.

For example, suppose you think expected returns are governed by the CAPM. Then $\mu_t = \mathit{RiskfreeRate}_t$ is the prevailing risk-free rate at time $t$ , $X_n = \Cov[\mathit{Return}_{n,t}, \, \mathit{Market}_t] \, / \, \Var[\mathit{Return}_{n,t}]$ is the market beta on the $n$ th asset, and $\lambda$ is price of an increase in exposure to this market risk factor:

$\begin{equation*} \mathit{Return}_{n,t}(p) = \underbrace{\alpha \cdot 1_{\{ p = \texttt{hi} \}}}_{\substack{\text{effect\phantom{j}of\phantom{j}anom-} \\ \text{lous predictor}}} + \underbrace{\mu_t + \lambda \cdot X_n + \varepsilon_{n,t}}_{\substack{\text{what\phantom{j}determines\phantom{j}returns} \\ \text{in asset-pricing model}}} \qquad \qquad \varepsilon_{n,t} \overset{\scriptscriptstyle \text{IID}}{\sim} \mathrm{Normal}(0, \, \sigma^2) \end{equation*}$

The asset’s return should be higher if the risk-free rate is higher ( $\mu_t$ is high), if it has lots of exposure to market risk ( $X_n$ is high), and/or if the price of this exposure to market risk is high ( $\lambda$ is high).

The core claim in any asset-pricing model (CAPM included) is that, after controlling for the $X$ variables specified in the model, it shouldn’t be possible to find another predictor, $p \in \{\texttt{lo}, \, \texttt{hi} \}$ , that forecasts returns:

$\begin{equation*} \text{claim: $\alpha = 0$ for every predictor $p$ that you can think of} \end{equation*}$

And how would an asset-pricing researcher test to see if $\alpha = 0$ ? He’d compare the $n$ th asset’s returns to the returns of a replicating portfolio whose weights were chosen so that it had the exact same value of $X_n$ . e.g., suppose we’re in a CAPM world, and the $n$ th asset has a market beta of $X_n = 0.50$ . If there are two other assets with betas equal to $0.20$ and $0.80$ respectively, you should compare the $n$ th asset’s return to the return of a equally weighted portfolio of those two assets, $X_n = (1/2) \cdot 0.20 + (1/2) \cdot 0.80 = 0.50$ . Exact same situation! And the original synthetic-control paper (AbadieGardeazabal2003) pointed out as much!

Some takeaways

This connection between the synthetic-control literature and the asset-pricing literature delivers some interesting takeaways on both sides. First, let’s think about it from the perspective of an asset-pricing researcher. There have been several recent econometric advances in the study of synthetic controls. e.g., Chen2023 frames the synthetic-control procedure as an online learning problem. The paper then uses this parallel to give policymakers some guidance on when and where synthetic control is most likely to be successful. By framing risk adjustment as a specific instance of a more general approach to producing synthetic controls, asset-pricing researchers might be able to port over some of these recent advances.

I think the message is a bit less positive when traveling from asset pricing back to the econometrics of synthetic controls. It’s been 50 years since Merton1973 introduced the ICAPM, and asset-pricing researchers have yet to agree on which $X$ s to use when doing our risk adjustments. This fact should give econometricians pause when considering the limits of the synthetic-control approach. In a recent review article, Abadie2021 argues that the synthetic-control methodology offers a “safeguard against specification searches. (p406)” Judging by the current state of the asset-pricing literature, I’m not sure this is true.

Abadie2021 also argues that, while a researcher using the synthetic-control procedure might make an error in choosing control variables, at least the procedure is transparent about how a counterfactual is being constructed. The procedure itself is certainly transparent. But I’m not sure how many people really think through the logic now that synthetic control has gone mainstream. How many people think of the buildup of pus in a pimple when they use the phrase “coming to a head”? The conceptual metaphor is perfectly transparent. But most people never look. In a similar vein, asset-pricing researchers often use the FamaFrench1993 three-factor model to “control for risk” in spite of the fact that real-world investors aren’t trying to use their stock portfolios to buy insurance against these risk factors. An empirical procedure which initially encourages introspection can eventually turn into a stale thoughtless idiom. The asset-pricing literature suggests that econometricians should be more worried about this trend.

Interpreting the LASSO as a really simple neural network

January 10, 2023 by Alex

Suppose you want to forecast the return of a particular stock using many different predictors (think: past returns, market cap, asset growth, etc…). One way to do this would be to use the LASSO. Alternatively, you could use a neural network to make your forecast. On the surface, these two approaches look very different. However, it turns out that it’s possible to recast the LASSO as a *really* simple neural network.

This post outlines how.

This connection suggests we can use penalized regressions, such as the LASSO, as microscopes for studying more complicated machine-learning models, like neural networks, which often exhibit surprising new behavior. For example, if you include more predictors than observations in an OLS regression, then you’ll be able to perfectly fit your training data but your out-of-sample performance will be terrible. By contrast, highly over-parameters neural networks often have the best out-of-sample fit.

Because these models are so complicated, it’s often hard to understand why a pattern like this might emerge. Penalized regression models like the LASSO occupy a middle ground between OLS and complicated machine-learning models. Thus, if the LASSO can be viewed as a really simple neural net, then it might be possible to use this intermediate setup as a laboratory for understanding more complicated procedures. That’s the idea behind HastieMontanariRossetTibshirani22. And KellyMalamudZhou22 build on their logic.

General setup

Imagine that you’ve got historical data on the returns of $N \gg 1$ different stocks, $\{ \mathit{Ret}_n \}_{n=1}^N$ , and you want to make the best forecast possible for the future return of the $(N+1)$ st stock, $\widehat{\mathit{Ret}}_{N+1}$ . You have access to $K \gg 1$ different return predictors. Let $X_{n,k}$ denote the value of the $k$ th predictor for the $n$ th stock. Assume that each predictor has been normalized to have mean zero and variance $(1/K)$ in the cross-section. Without loss of generality, also assume that the cross-sectional average return is zero.

If there were only one predictor, $K = 1$ , then it’d be possible to estimate the OLS regression below:

$\begin{equation*} \hat{\beta}^{\text{OLS}} = \arg \min_{\beta} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \beta \cdot X_n \, \right\}^2 \end{equation*}$

In this case, the solution is given by $\hat{\beta}^{\text{OLS}} \propto \sum_{n=1}^N \, (\mathit{Ret}_n - 0) \times (X_n - 0)$ . If the predictor tends to be positive, $X_n > 0$ , for stocks that subsequently realize positive returns, $\mathit{Ret}_n > 0$ , then the OLS slope coefficient associated with it will be positive. It will also be profitable to trade on this predictor, too.

You can also use an OLS regression to create a return forecast when you have more than one predictor

$\begin{equation*} \widehat{\mathit{Ret}}_{N+1}^{\text{OLS}} = \sum_{k=1}^K \hat{\beta}_k^{\text{OLS}} \cdot X_{N+1,k} \end{equation*}$

provided that you still have more observations than predictors, $N > K$ . If you’ve got $K=200$ predictors and $N=500$ stocks in your training data, then you’re in business. However, if your training data only contains $N=100$ stocks, then you’re SOL. You’ll have to use something other than an OLS regression.

The LASSO

One popular approach is to fit a LASSO specification. This is essentially an OLS regression with an additional absolute-value penalty applied to each predictive coefficient:

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + 2 \cdot \lambda \cdot \sum_{k=1}^K |\beta_k| \end{equation*}$

The pre-factor of $\lambda \geq 0$ in front of the penalty term is a tuning parameter, which can be optimally chosen via cross-validation. Notice that, when $\lambda = 0$ , there is no penalty at all and the LASSO is equivalent to OLS. But when $\lambda > 0$ , the LASSO’s coefficients will differ from OLS estimates as shown in the interactive figure below.

To see what I mean, let’s return to the case where there’s only one predictor. Alternatively, you could think about a world with orthogonal predictors, $\Cov(X_k, \, X_{k'}) = 0$ for all $k \neq k'$ . In either case, we have:

$\begin{equation*} \hat{\beta}_k^{\text{LASSO}} = \mathrm{Sign}(\hat{\beta}_k^{\text{OLS}}) \times \max\big\{ 0, \, |\hat{\beta}_k^{\text{OLS}}| - \lambda \big\} \end{equation*}$

This expression tells us that the LASSO does two things. First, it shrinks large OLS coefficients toward zero, $|\hat{\beta}_k^{\text{LASSO}}| < |\hat{\beta}_k^{\text{OLS}}|$ . Second, it forces all small OLS coefficients, $|\hat{\beta}_k^{\text{OLS}}| < \lambda$ , to be exactly zero, $\hat{\beta}_k^{\text{LASSO}} = 0$ .

Neural network

The LASSO is still able to make forecasts in situations where there are more predictors than observations because it kills off all the smallest predictors. Morally speaking, if only $5$ of your $K=200$ predictors have any forecasting power, then you shouldn’t need $N \geq 200$ observations to figure this out. $20$ data points should do just fine. An alternative approach to making a return forecast when $K > N$ would be to use a neural network. On the surface, this seems like a very different strategy. Instead of a bet on sparsity, large neural networks often perform best when highly over-parameterized.

There are lots of kinds of neural networks. In this post, I’m going to mainly focus on neural networks with only one hidden layer that has the same number of nodes as predictors. e.g., with $K=200$ predictors, there will be $H = 200$ hidden nodes. The diagram to the left shows what this would look like in a situation with $K=3$ predictors and $H=3$ hidden nodes so that we can see what’s going on.

The value of each hidden node is determined by an activation function that takes a linear combination of predictor values as its input:

$\begin{equation*} H_k = \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \end{equation*}$

e.g., you could set $\mathrm{h}(z) = z$ , $\mathrm{h}(z) = \max\{0, \, z \}$ , or something else entirely. $\vec{\beta}_k = (\beta_{0 \to k}, \, \beta_{1 \to k}, \ldots, \, \beta_{K \to k})$ contains the weights that go into the $k$ th hidden node. It has $(K+1)$ elements due to the intercept term.

The return forecast generated by this neural network, $\widehat{\mathit{Ret}}_{N+1}^{\text{NNet}}$ , is then a weighted average of its $K$ hidden nodes where the weights are chosen by solving the optimization problem below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \vec{\beta}_1, \ldots, \vec{\beta}_K}} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \alpha_k \cdot \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \, \right\}^2 \!\! + \lambda \cdot \sum_{k=1}^K \left( \alpha_k^2 + \beta_{0 \to k}^2 + \sum_{k'=1}^K \beta_{k'\to k}^2 \right) \end{equation*}$

This objective function includes a penalty term just like the LASSO, but the penalty is quadratic. It’s equivalent to the common practice of training a neural network via gradient descent with weight decay.

Degrees of freedom

If our goal is to write down the LASSO as a special case of a neural network, then there are two apparent differences that need to be finessed. The first involves degrees of freedom. In the LASSO, there is one parameter that needs to be estimated for each predictor. In the neural network above, each predictor is associated with $(K + 2)$ free parameters. In addition, you must also choose an activation function, $\mathrm{h}(\cdot)$ .

To represent the LASSO as a neural network, we’re going to have to shut down $(K+1)$ of the degrees of freedom associated with each predictor. So, let’s start by looking at a neural network that’s “simply connected”—i.e., a network where $\beta_{k' \to k} = 0$ whenever $k' \neq k$ . Let’s also assume a linear activation function, $\mathrm{h}(z) = z$ , and restrict ourselves to the case where there’s no constant term, $\beta_{0 \to k} = 0$ .

After making these assumptions, we are left with the neural network in the diagram above. There are now only two free parameters associated with each predictor: $\alpha_k$ and $\beta_{k \to k}$ . To estimate all $2 \cdot K$ of these values, we must maximize the objective below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \beta_{1 \to 1}, \ldots, \beta_{K \to K}}} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \alpha_k \cdot \beta_{k \to k} \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K (\alpha_k^2 + \beta_{k \to k}^2) \end{equation*}$

This looks almost like the LASSO objective function. But there’s still one glaring difference left…

Nature of the penalty

In the LASSO, we’ve got an absolute-value penalty; whereas, the neural network has a quadratic penalty. This seems important! To see why, consider replacing the absolute-value penalty in the LASSO with

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K \beta_k^2 \end{equation*}$

When you do this, you’re left with something called the Ridge regression.

Just like with the LASSO, we can characterize the Ridge estimates relative to OLS in the case where there’s only one predictor or all predictors are orthogonal to one another:

$\begin{equation*} \hat{\beta}_k^{\text{Ridge}} = \left( {\textstyle \frac{1}{1 + \lambda}} \right) \times \hat{\beta}_k^{\text{OLS}} \end{equation*}$

When you increase the value of $\lambda$ in the figure to the right, you’ll see that the slope of the line changes. The larger the $\lambda$ , the less $\hat{\beta}_k^{\text{Ridge}}$ changes in response to a change in $\hat{\beta}_k^{\text{OLS}}$ . Notice how this effect is qualitatively different from the effect of increasing $\lambda$ in a LASSO specification. There, $\lambda$ controlled the size of the inaction region. But, provided that $|\hat{\beta}_k^{\text{LASSO}}| > 0$ , the LASSO estimate always moved one-for-one with $\hat{\beta}_k^{\text{OLS}}$ .

However, this Ridge intuition is misleading. In the simply-connected neural-network structure that I outline above, we are not choosing a single coefficient $\beta_{k \to k}$ . Instead, because there is a hidden layer, we are choosing the product of $\alpha_k \cdot \beta_{k \to k}$ . And this makes all the difference. For any value of $c \geq 0$ , we have that

$\begin{equation*} \min_{\alpha,\beta \geq 0} \big\{ \, \alpha^2 + \beta^2 : \alpha \cdot \beta = c \, \big\} = 2 \cdot |c| \end{equation*}$

where the minimum is at $\alpha_k = \beta_{k \to k} = \sqrt{c}$ . This is just the inequality relating arithmetic and geometric averages. It’s what allows a single hidden layer to sneak in a threshold through the back door.

Some extensions

We’ve just seen that you can think about the LASSO as a simply-connected two-layer neural network with a linear activation function and no bias terms, which was trained via gradient descent with weight decay. This is not my observation. I first saw it in Tibshirani21. The step where you reduce the degrees of freedom is obvious enough. But I had never made the connection with the arithmetic/geometric mean inequality. That second step struck me (and still strikes me) as really cool. It’s also a very concrete example of the flexibility inherent in neural networks. The hidden layer allows a neural network to do things you wouldn’t guess possible based only on the functional forms involved.

In addition to outlining the argument above, Tibshirani21 also gives a couple of other interesting extensions. e.g., the note shows how, by increasing the number of hidden layers in the neural network, you can reproduce the output of a LASSO-like specification below

The more hidden layers you include, the closer you get to best-subset selection, $q=0$ . The note also shows that it’s possible to write group-LASSO as a neural network that ain’t quite so simply connected.

Where’s the “narrative” in “narrative economics”?

October 29, 2022 by Alex

Bob Shiller defines “narrative economics” as the study of “how narrative contagion affects economic events”. This research program focuses on two things: “(1) the word-of-mouth contagion of ideas in the form of stories and (2) the efforts that people make to generate new contagious stories or to make stories more contagious.” In other words, if you could somehow get people to stop telling each other tall tales, then stock prices, GDP, interest rates, housing sales, etc would be different.

I’m a huge fan of this work. And even the most ardent critics of narrative economics still appreciate the power of a good narrative. For example, like all economics papers, the Campbell-Cochrane habit-formation paper has an introduction. The introduction gives an intuitive explanation of how the paper works using evocative language. The paper is called By Force of Habit for god’s sake. If that isn’t an effort to make the paper’s story more contagious, I don’t know what is.

But many researchers feel that narrative economics (in its current form) is less than scientific. And I think there’s something to these claims. Narrative economics says that economic outcomes are different because of the fact that people tell each other stories and embellish these stories in the process. But there are no specific parameters corresponding to these two tendencies in any economic model. Put differently, There’s no “narrative” in models of “narrative economics”. Whether or not a paper gets classified as a narrative-based model often comes down to how it’s written. This makes it hard to analyze how tall tales affect economic outcomes. What would the counterfactual world without the contagious narratives look like?

Compare and contrast this state of affairs how other economic forces get modeled. For example, the Campbell-Cochrane habit-formation paper allows risk aversion to vary over time. We know this because we can directly point to this parameter in the model. And, as a result, we can imagine a world where risk aversion is no longer allowed to vary over time. It’s not clear how to do the same thing in narrative-based models. There’s no parameter that corresponds to the story-telling instinct.

To place narrative economics on firm footing, we need some way of toggling on and off peoples’ tendency to tell tall tales in our models. There needs to be a narrative module in our models. That way, we can analyze the effect of this module on things like stock prices, GDP, interest rates, housing sales, etc. This post highlights two problems with current narrative-based models that make it difficult to accomplish this goal. Then, I conclude by suggesting a way to put the “narrative” into models of narrative economics.

Narratives aren’t the only kind of epidemic

When Shiller talks and writes about narrative economics, he plays up the importance of how stories go “viral”. The reason why narratives have the power to affect economic outcomes is that they can spread contagiously from person to person via word of mouth like a meme or viruse. Shiller has been a strong proponent of using epidemiological models to study financial markets. I’m a huge fan of this idea! I’ve even got a paper which takes this exact approach. Epidemiological models can tell us a lot about how trader interactions affect market outcomes, leading to things like booms and busts.

That being said, it’s important to emphasize that there is no “narrative” in epidemiological models. In these models, something is exchanged when two agents interact with one another. That something could be a virus, a story, or an Egg McMuffin recipe. Anything that these models say about narratives must also apply to a virus or a delicious new way to start your day. Epidemiological models are models of interacting agents not models of what happens when agents interact. Social finance and narrative economics are distinct fields.

Suppose there are $N$ people, of which, $S$ are currently sick. The remaining $N - S = H$ people are healthy. Each instant, a sick person encounters a healthy person with probability $(H/N) \cdot \mathrm{d}t$ . And, when that happens, the healthy person becomes sick at a rate of $\beta$ per interaction. Otherwise, sick people recover with probability $\gamma \cdot \mathrm{d}t$ each instant. This implies that the total population of sick people will evolve as

(1) $\begin{equation*} \mathrm{d}S = \underset{\text{contract disease}}{[\beta \cdot (H/N) \cdot S] \cdot \mathrm{d}t} - \underset{\text{get healthy}}{[\gamma \cdot S] \cdot \mathrm{d}t} \end{equation*}$

Flipping the logic around then says that the population of healthy people must evolve according to:

(2) $\begin{equation*} \mathrm{d}H = \underset{\text{get healthy}}{[\gamma \cdot S] \cdot \mathrm{d}t} - \underset{\text{contract disease}}{[\beta \cdot (S/N) \cdot H] \cdot \mathrm{d}t} \end{equation*}$

Notice that there is no biology in this model. There’s nothing specific to structure of viruses or the life cycle of bacteria. There are just two interacting populations. These could be sick and healthy people. Or they could be excited speculators and rational investors. Epidemiological models can capture how an economic narrative spreads through a population. But they can tell us nothing about what an economic narrative actually is. This is a non-starter. We want to be able to plug the narrative module (whatever that happens to be) into an epidemiological model. But a narrative is different from how it spreads.

The tell-tale signs of people telling tall tales

Right now, economists mainly think in terms of models where investors solve forward-looking constrained optimization problems. Narrative economics argues that it’s valuable to think about narratives and models rather than just models. If this is true, then the story-telling instinct must be able to explain phenomenon that the existing paradigm can’t. Narratives must be more than just bad explanations. They must be something that is fundamentally outside the currently modeling paradigm. Otherwise, it won’t be possible to distinguish the implications of the narrative from the implications of a fine-tuned economic model.

Here’s what I mean. There are several recent papers (e.g., see here and here) that study narratives by incorporating ideas from the causal-inference literature. These papers model narratives using directed acyclical graphs (DAGs). If you have an underlying structural-equation model for the economy, then you can represent this model’s causal implications using a DAG in a way that is largely independent of the magnitudes of the parameter values involved. All that matters is the zero vs nonzero distinction.

Causal relationships without nitty-gritty parameter estimates… this might at first seem like a promising way of modeling narratives. But here’s the thing: anything that can be captured by a DAG can also be written down as a standard economic model. So, if you model narratives using DAGs, it can never be clear which is the real driver—the narrative or the underlying model.

To illustrate, suppose that the Campbell-Cochrane habit-formation model were strictly true. Suppose that the model in that paper was actually the data-generating process for observed asset-prices in the real world. In this fictitious scenario, further suppose that whenever you ask traders, they talk about the world in exactly the way that Campbell and Cochrane do in their introduction. In this set up, traders’ would have a clear narrative about what was going on in the market, and this narrative would fit perfectly into a DAG. But the narrative would not be responsible for the observed market data. If you didn’t ask traders about what they were doing, all economic outcomes would be the exact same.

DAGs capture the component of narratives that could be incorporated into well-posed model. If it’s important to consider narratives in addition to standard economic models, then their contribution must come from something that cannot be captured by simply adding a new variable to a DAG. Narrative economics must represent more than just throwing away some of the information in a model.

Narratives determine how people construe events

Narrative economics says that economic outcomes are different because people tell each other stories that get exaggerated with each retelling. To test this claim, we want some way of adding a narrative to an existing economic model. Then, we can flip the story-telling instinct on and off in the model and examine the consequences.

Whatever this narrative module looks like, it won’t be tied to epidemiological models. These models aren’t specific to narratives. They’re called “epidemiological models” for a reason. What’s more, if we want to distinguish narrative economics from the existing model-based paradigm, then a narrative must be more than just a new variable or parameter. Otherwise, it would be easy to achieve the same results using a standard model. If all you do is show that prices tend to go up with there are more positive words in the Wall Street Journal, it’s easy to gin up alternative stories that don’t involve investors tell each other good stories.

We can’t define a narrative by studying epidemiological models. And we can’t associate a narrative with a single new variable or parameter? Where does that leave us?

In his 2013 Nobel Prize lecture, Bob Shiller urged economists to incorporate more ideas from psychology, sociology, and other fields. And I think this is exactly the right way to go. But, rather than turning to epidemiology, let’s look at the subfield of psychology that studies the interface between language and the mind—namely, cognitive linguistics. We want to identify the economic effects of conveying information between people via the medium of story. It stands to reason that stories might be stored differently by the brain relative to statistics, song, interpretive dance, divine proclamation, etc.

Cognitive linguistics tells us that stories affect how people “construe events” being related to one another. For example, the force dynamics paradigm says that letting something be is not the same as pushing on it with zero force even though these two situations are identical according to every physics textbook. The narrative module we are looking for should tell us when to construe the same events in different ways.

Here’s another example. Suppose there are $600$ people with a deadly disease and doctors are asked to choose between two treatments. Treatment A results in $400$ deaths. Under treatment B there is a $33\%$ chance that no one will die but a $66\%$ chance that all $600$ people will die. The narrative should explain whether doctors frame this famous choice as “(treatment A saves $200$ and treatment B kills everyone $66\%$ of the time)” or as “(treatment A kills $400$ people and treatment B saves everyone $33\%$ of the time)”.

There is direct evidence that human brains reason about stories using something like the force dynamics paradigm (e.g., see studies like this one). So we are not talking about layering on a “narrative interpretation” to a model as is the case with epidemiological models. And, if a story can change the *relationships among entire collections of variables*, it’s not easy to account for its predictions using a single existing model. Because its effects are non-local, you would need an entirely new model for each construal of events. Turning off the narrative module would be akin to steadfastly adhering to only one model. A narrative module should account for the way that people pick and choose which model to apply at different points in time. That’s my guess about how to model the “narrative” in “narrative economics”.