Evidence from US Equities
Passive investing now accounts for more than 60% of total US fund assets. As index-tracking vehicles mechanically allocate capital based on index membership rather than fundamental value, a growing body of literature questions whether this structural shift impairs the price discovery process that underpins market efficiency.
This thesis investigates two related but distinct channels through which passive dominance may distort equity markets: valuation inflation among index constituents, and the deterioration of return independence as correlation structures become increasingly driven by mechanical rebalancing rather than firm-specific information.
The empirical framework draws on over three decades of CRSP daily data, Compustat valuation ratios, and Morningstar fund flow data — covering 1,255 S&P 500 constituent firms across 8,311 trading days. All estimation is performed in Python using a reproducible, point-in-time pipeline with no look-ahead bias.
Theoretical anchors: Grossman & Stiglitz (1980) — the information acquisition paradox — and Asness (2024) — the Less-Efficient Market Hypothesis (LEMH).
Two Research Questions
The thesis tests two primary hypotheses, each addressing a distinct dimension of market quality under passive dominance.
Passive flows inflate valuations beyond fundamentals
Passive capital inflows decouple stock prices from fundamental value, driving P/E and P/B multiples of index constituents beyond what earnings growth explains. Index membership creates a valuation premium unrelated to firm quality.
Passive investing increases correlation and concentration risk
Mechanical rebalancing increases return co-movement, price synchronicity, and correlation asymmetry — particularly during market downturns. This increases systemic risk across the index and reduces the portfolio diversification benefit for investors.
Four Primary Sources
The empirical analysis draws on four distinct data sources, spanning January 1992 to December 2024. All data assembly is performed in Python with point-in-time integrity enforced throughout — no data item is used before its public availability date.
| Source | Provider | Content | Coverage |
|---|---|---|---|
| CRSP Daily Stock File | WRDS | Daily returns, prices, volume, market cap for S&P 500 constituents and large-cap comparison universe | 1992–2024 |
| Compustat Fundamentals | WRDS | Quarterly P/E and P/B ratios at the firm level via WRDS firm_ratio; winsorised at 1st/99th percentile | 1992–2022 |
| S&P 500 Constituent History | WRDS + GitHub | Annual snapshots 1992–2022 from WRDS; 2023–2024 extended via fja05680/sp500 change log with backward merge_asof join | 1992–2024 |
| Morningstar Direct | Morningstar | Monthly active vs. passive net flows, total net assets, and organic growth rates at fund and category level | 1993–2024 |
Constructed Datasets
Raw source data is assembled into four analysis-ready files:
Daily constituent-level panel. Includes closing price, return, shares outstanding, volume, market cap, and winsorised P/E and P/B. Covers all trading days 1992–2024 across 1,255 unique PERMNOs (498–507 constituents per day).
Monthly panel of large-cap stocks outside the S&P 500. Used as the control group for difference-in-differences and cross-sectional tests, allowing isolation of index membership effects.
Point-in-time panel recording daily index membership for each PERMNO. Built from 33 annual WRDS snapshots and supplemented with GitHub change-log data for 2023–2024. Zero orphaned in-index observations.
Monthly active vs. passive net flows and total net assets from Morningstar Direct. Key regression variable is organic growth rate (flow / lagged AUM), not total AUM, to avoid scale distortion. Covers 1993–2024.
Macroeconomic Controls
To isolate the effect of passive flows from broader macroeconomic conditions, six external control series are merged into the analysis panel. The key passive flow variable is orthogonalised against interest rates and active flows to address multi-collinearity.
The passive share variable is double-orthogonalised against the 10-year Treasury yield and active flow rate before entry into panel regressions, producing passive_share_orth. This removes the mechanical correlation between passive growth and the broader rate environment.
Econometric Framework
The analysis employs a suite of time-series and panel econometric methods, selected to address both the cointegration properties of passive share and valuation multiples, and the panel structure of firm-level data.
Time-Series Methods
Panel Methods
Implementation
All estimation is conducted in Python using statsmodels, linearmodels, and scipy. CRSP data is accessed via WRDS using SQLAlchemy with sqlalchemy.text() after resolving a psycopg2 incompatibility. Point-in-time merges use pd.merge_asof throughout to prevent look-ahead bias.