01

Evidence from US Equities

Passive investing now accounts for more than 60% of total US fund assets. As index-tracking vehicles mechanically allocate capital based on index membership rather than fundamental value, a growing body of literature questions whether this structural shift impairs the price discovery process that underpins market efficiency.

This thesis investigates two related but distinct channels through which passive dominance may distort equity markets: valuation inflation among index constituents, and the deterioration of return independence as correlation structures become increasingly driven by mechanical rebalancing rather than firm-specific information.

The empirical framework draws on over three decades of CRSP daily data, Compustat valuation ratios, and Morningstar fund flow data — covering 1,255 S&P 500 constituent firms across 8,311 trading days. All estimation is performed in Python using a reproducible, point-in-time pipeline with no look-ahead bias.

Theoretical anchors: Grossman & Stiglitz (1980) — the information acquisition paradox — and Asness (2024) — the Less-Efficient Market Hypothesis (LEMH).

02

Two Research Questions

The thesis tests two primary hypotheses, each addressing a distinct dimension of market quality under passive dominance.

H1 — Price Distortion

Passive flows inflate valuations beyond fundamentals

Passive capital inflows decouple stock prices from fundamental value, driving P/E and P/B multiples of index constituents beyond what earnings growth explains. Index membership creates a valuation premium unrelated to firm quality.

H2 — Systematic Fragility

Passive investing increases correlation and concentration risk

Mechanical rebalancing increases return co-movement, price synchronicity, and correlation asymmetry — particularly during market downturns. This increases systemic risk across the index and reduces the portfolio diversification benefit for investors.

03

Four Primary Sources

The empirical analysis draws on four distinct data sources, spanning January 1992 to December 2024. All data assembly is performed in Python with point-in-time integrity enforced throughout — no data item is used before its public availability date.

Source Provider Content Coverage
CRSP Daily Stock File WRDS Daily returns, prices, volume, market cap for S&P 500 constituents and large-cap comparison universe 1992–2024
Compustat Fundamentals WRDS Quarterly P/E and P/B ratios at the firm level via WRDS firm_ratio; winsorised at 1st/99th percentile 1992–2022
S&P 500 Constituent History WRDS + GitHub Annual snapshots 1992–2022 from WRDS; 2023–2024 extended via fja05680/sp500 change log with backward merge_asof join 1992–2024
Morningstar Direct Morningstar Monthly active vs. passive net flows, total net assets, and organic growth rates at fund and category level 1993–2024

Constructed Datasets

Raw source data is assembled into four analysis-ready files:

Primary Panel
S&P 500 Constituent Panel
daily_panel.csv
Rows: 6,495,873
Cols: 13

Daily constituent-level panel. Includes closing price, return, shares outstanding, volume, market cap, and winsorised P/E and P/B. Covers all trading days 1992–2024 across 1,255 unique PERMNOs (498–507 constituents per day).

Comparison Universe
Non-Index Large-Cap Panel
crsp_nonindex.csv
Rows: 447,659
Cols: 12

Monthly panel of large-cap stocks outside the S&P 500. Used as the control group for difference-in-differences and cross-sectional tests, allowing isolation of index membership effects.

Membership History
Constituent Membership Log
sp500_membership.csv
Rows: 4,165,036
Cols: 4

Point-in-time panel recording daily index membership for each PERMNO. Built from 33 annual WRDS snapshots and supplemented with GitHub change-log data for 2023–2024. Zero orphaned in-index observations.

Fund Flows
Morningstar Flow Panel
morningstar.csv
Rows: 383
Cols: 11

Monthly active vs. passive net flows and total net assets from Morningstar Direct. Key regression variable is organic growth rate (flow / lagged AUM), not total AUM, to avoid scale distortion. Covers 1993–2024.

04

Macroeconomic Controls

To isolate the effect of passive flows from broader macroeconomic conditions, six external control series are merged into the analysis panel. The key passive flow variable is orthogonalised against interest rates and active flows to address multi-collinearity.

FRED
10-Year Treasury Yield
GS10 — monthly, risk-free rate proxy
FRED
Fed Funds Rate
FEDFUNDS — monetary policy stance
FRED
CPI Inflation
CPIAUCSL — YoY % change, monthly
FRED
Real GDP Growth
A191RL1Q225SBEA — quarterly, interpolated
CBOE
VIX
Daily volatility index, collapsed to monthly
Ken French
Fama-French 5 Factors
Mkt-RF, SMB, HML, RMW, CMA — monthly

The passive share variable is double-orthogonalised against the 10-year Treasury yield and active flow rate before entry into panel regressions, producing passive_share_orth. This removes the mechanical correlation between passive growth and the broader rate environment.

05

Econometric Framework

The analysis employs a suite of time-series and panel econometric methods, selected to address both the cointegration properties of passive share and valuation multiples, and the panel structure of firm-level data.

Time-Series Methods

OLS-HAC FMOLS (Phillips & Hansen 1990) DOLS Error Correction Model (ECM) Granger Causality via VAR ADF / KPSS Unit Root Tests Johansen Cointegration

Panel Methods

Panel Fixed Effects (firm + time FE) Difference-in-Differences PCA Variance Decomposition Crisis Period Subsample Analysis

Implementation

All estimation is conducted in Python using statsmodels, linearmodels, and scipy. CRSP data is accessed via WRDS using SQLAlchemy with sqlalchemy.text() after resolving a psycopg2 incompatibility. Point-in-time merges use pd.merge_asof throughout to prevent look-ahead bias.