Heads Collapse, Features Stay:
Why Replay Needs Big Buffers

Giulia Lanzillotta*, Damiano Meier*, Thomas Hofmann
*Equal Contribution

ETH Zürich • ETH AI Center

ICLR 2026

Continual Learning & Experience Replay

Continual Learning

Train on sequence of tasks without forgetting

Task 1

🐱🐶

→

Task 2

🚗✈️

→

Task 3

🌳🌺

Learn new tasks sequentially

Experience Replay

Store samples from past tasks in buffer

Task 3 Data

+

Buffer

↓

Train Network

Cost: Scales with buffer size

Two Levels of Forgetting

Deep Forgetting

Linear Probe Evaluation

Features φ(x)
(FROZEN)

→

New Linear
Classifier

Can we still classify old tasks?

\( F_{\text{deep}}^{i \to j} = A^*_{jj} - A^*_{ij} \)

vs

Shallow Forgetting

Original Network

Features φ(x)

→

Original
Head

Direct output prediction

\( F_{\text{shallow}}^{i \to j} = A_{jj} - A_{ij} \)

Networks retain more information in representations than in predictions

The Replay Efficiency Gap

Small buffers prevent deep forgetting, but large buffers are needed for shallow forgetting

Forgetting decays at different rates in feature space vs. classifier head across buffer sizes

Analytical Framework: Neural Collapse

Neural Collapse (NC): Geometric structures in terminal phase of training
NC1: Within-class variance → 0
NC2: Class means form simplex ETF

\( \langle \tilde{\mu}_c, \tilde{\mu}_{c'} \rangle = \begin{cases} \beta_t & \text{if } c = c' \\ -\frac{\beta_t}{K-1} & \text{if } c \neq c' \end{cases} \)

NC3: Classifier weights align with class means: \( W_h^\top \propto \tilde{U} \)

Our contribution: First to extend NC formulation to continual learning
(all 3 setups: DIL, TIL, CIL)

Focus: Linear separability → characterize class mean & covariance

Neural Collapse emergence:

● Class 1 ● Class 2 ● Class 3

Animation: Features collapse from chaos to simplex ETF (Papyan et al., 2020)

Cornerstone: Subspace Stabilization

Theorem (Subspace Stabilization in TPT):
Once NC3 emerges, gradient updates are confined to the active subspace \(S\)

\( S_t = \text{span}\{\tilde{\mu}_1(t), \ldots, \tilde{\mu}_K(t)\} = S_{t_0} \quad \forall t \geq t_0 \)

Key Insight:

After NC3, \(\text{span}(W_h) = S\)
Loss gradients only affect features in \(S\)
Components in \(S^\perp\) are frozen (or decay with weight decay)

This is the foundation for understanding what happens to old data representations

Connection to Out-of-Distribution Detection

Key Contribution: First explicit connection between CL forgetting and OOD detection

Our Hypothesis

Forgotten samples behave as OOD data

\( \text{NC5: } \langle \phi_{\text{OOD}}(x), \tilde{\mu}_c \rangle \approx 0 \quad \forall c \)

OOD features orthogonal to active subspace \(S\)
Without replay: past tasks drift to \(S^\perp\), decay exponentially

Related work:

Ammar et al. (2024): NC5 property

Haas et al. (2023): L2 regularization & OOD

Kang et al. (2024): OOD collapse to origin

Past-task features project to zero in \(S\)

Feature Space Distribution with Replay

Buffer-OOD Mixture Model: Features interpolate between two extremes

\( \phi_t(x) \sim \pi_c \mathcal{D}_{\text{NC}} + (1-\pi_c) \mathcal{D}_{\text{OOD}} \)

Theorem: Any non-zero replay fraction \(\pi > 0\) guarantees asymptotic linear separability

\( \text{SNR}(c_1, c_2) \in \Theta\left( \frac{r^2\beta_t + \upsilon^{2(t-t_0)}}{r^2\delta_t + \beta_t + \upsilon^{2(t-t_0)}} \right), \quad r^2 = \frac{\pi^2}{(1-\pi)^2} \)

Consequences for Old Data

NC emerges consistently across tasks with replay

Without Replay

Old task representations drift into \(S^\perp\)

\( \mu_c(t) = (1-\eta\lambda)^{t-t_0} \mu_{c,S^\perp}(t_0) \)

Exponential decay with weight decay

With Replay

Representations anchored in \(S\)

\( \mu_c(t) = \pi \tilde{\mu}_c(t) + (1-\pi) \mu_{c,S^\perp}(t_0) \)

Buffer provides foothold in active subspace

Why Shallow Forgetting Persists

If features are separable, why do classifiers fail?

Strong Collapse: Small buffers induce rank-deficient covariances
Under-determined Classifier: Multiple "buffer-optimal" boundaries fit stored samples perfectly
Statistical Gap: Buffer estimates deviate from population statistics

Animation: Features remain separable, but classifier boundaries misalign with small buffers

Deconstructing the Statistical Gap

Covariance Deficiency

Buffer covariance \(\hat{\Sigma}\) is rank-deficient, blind to variance in \(S^\perp\)

Rank gap persists until buffer ≈ 100%

Mean Norm Inflation

Buffer means exhibit inflated norms due to repulsive forces

\(\|\hat{\mu}_c\| > \|\mu_c\|\) for small buffers

Gap between population and buffer statistics persists across buffer sizes

Conclusions & Implications

Low buffers already mitigate deep forgetting — No need for large buffers to maintain feature-space geometry
The Replay Asymmetry: Statistical gap between buffer and population causes shallow forgetting
Future Direction: Explicitly correcting statistical artifacts could unlock robust CL with minimal replay

Key Takeaway: Replay preserves feature geometry efficiently, but classifier alignment requires fundamentally different solutions

Thank you!

Heads Collapse, Features Stay:Why Replay Needs Big Buffers