Heads Collapse, Features Stay:
Why Replay Needs Big Buffers

Giulia Lanzillotta*, Damiano Meier*, Thomas Hofmann
*Equal Contribution
ETH ZΓΌrich β€’ ETH AI Center
ICLR 2026

Continual Learning & Experience Replay

Continual Learning

Train on sequence of tasks without forgetting

Task 1
🐱🐢
β†’
Task 2
πŸš—βœˆοΈ
β†’
Task 3
🌳🌺

Learn new tasks sequentially

Experience Replay

Store samples from past tasks in buffer

Task 3 Data
+
Buffer
↓
Train Network

Cost: Scales with buffer size

Two Levels of Forgetting

Deep Forgetting

Linear Probe Evaluation
Features Ο†(x)
(FROZEN)
β†’
New Linear
Classifier
Can we still classify old tasks?
\( F_{\text{deep}}^{i \to j} = A^*_{jj} - A^*_{ij} \)
vs

Shallow Forgetting

Original Network
Features Ο†(x)
β†’
Original
Head
Direct output prediction
\( F_{\text{shallow}}^{i \to j} = A_{jj} - A_{ij} \)

Networks retain more information in representations than in predictions

The Replay Efficiency Gap

Small buffers prevent deep forgetting, but large buffers are needed for shallow forgetting
Replay Efficiency Gap

Forgetting decays at different rates in feature space vs. classifier head across buffer sizes

Analytical Framework: Neural Collapse

  • Neural Collapse (NC): Geometric structures in terminal phase of training
  • NC1: Within-class variance β†’ 0
  • NC2: Class means form simplex ETF
\( \langle \tilde{\mu}_c, \tilde{\mu}_{c'} \rangle = \begin{cases} \beta_t & \text{if } c = c' \\ -\frac{\beta_t}{K-1} & \text{if } c \neq c' \end{cases} \)
  • NC3: Classifier weights align with class means: \( W_h^\top \propto \tilde{U} \)

Our contribution: First to extend NC formulation to continual learning
(all 3 setups: DIL, TIL, CIL)

Focus: Linear separability β†’ characterize class mean & covariance

Neural Collapse emergence:

● Class 1 ● Class 2 ● Class 3

Animation: Features collapse from chaos to simplex ETF (Papyan et al., 2020)

Cornerstone: Subspace Stabilization

Theorem (Subspace Stabilization in TPT):
Once NC3 emerges, gradient updates are confined to the active subspace \(S\)
\( S_t = \text{span}\{\tilde{\mu}_1(t), \ldots, \tilde{\mu}_K(t)\} = S_{t_0} \quad \forall t \geq t_0 \)

Key Insight:

  • After NC3, \(\text{span}(W_h) = S\)
  • Loss gradients only affect features in \(S\)
  • Components in \(S^\perp\) are frozen (or decay with weight decay)

This is the foundation for understanding what happens to old data representations

Connection to Out-of-Distribution Detection

Key Contribution: First explicit connection between CL forgetting and OOD detection

Our Hypothesis

Forgotten samples behave as OOD data

\( \text{NC5: } \langle \phi_{\text{OOD}}(x), \tilde{\mu}_c \rangle \approx 0 \quad \forall c \)
  • OOD features orthogonal to active subspace \(S\)
  • Without replay: past tasks drift to \(S^\perp\), decay exponentially

Related work:

Ammar et al. (2024): NC5 property

Haas et al. (2023): L2 regularization & OOD

Kang et al. (2024): OOD collapse to origin

OOD projection

Past-task features project to zero in \(S\)

Feature Space Distribution with Replay

Buffer-OOD Mixture Model: Features interpolate between two extremes

\( \phi_t(x) \sim \pi_c \mathcal{D}_{\text{NC}} + (1-\pi_c) \mathcal{D}_{\text{OOD}} \)
Theorem: Any non-zero replay fraction \(\pi > 0\) guarantees asymptotic linear separability
\( \text{SNR}(c_1, c_2) \in \Theta\left( \frac{r^2\beta_t + \upsilon^{2(t-t_0)}}{r^2\delta_t + \beta_t + \upsilon^{2(t-t_0)}} \right), \quad r^2 = \frac{\pi^2}{(1-\pi)^2} \)

Consequences for Old Data

NC metrics over time

NC emerges consistently across tasks with replay

Without Replay

Old task representations drift into \(S^\perp\)

\( \mu_c(t) = (1-\eta\lambda)^{t-t_0} \mu_{c,S^\perp}(t_0) \)

Exponential decay with weight decay

With Replay

Representations anchored in \(S\)

\( \mu_c(t) = \pi \tilde{\mu}_c(t) + (1-\pi) \mu_{c,S^\perp}(t_0) \)

Buffer provides foothold in active subspace

Why Shallow Forgetting Persists

If features are separable, why do classifiers fail?

  • Strong Collapse: Small buffers induce rank-deficient covariances
  • Under-determined Classifier: Multiple "buffer-optimal" boundaries fit stored samples perfectly
  • Statistical Gap: Buffer estimates deviate from population statistics
Shallow forgetting mechanism animation

Animation: Features remain separable, but classifier boundaries misalign with small buffers

Deconstructing the Statistical Gap

Covariance Deficiency

Buffer covariance \(\hat{\Sigma}\) is rank-deficient, blind to variance in \(S^\perp\)

Rank gap persists until buffer β‰ˆ 100%

Mean Norm Inflation

Buffer means exhibit inflated norms due to repulsive forces

\(\|\hat{\mu}_c\| > \|\mu_c\|\) for small buffers

Statistical gap metrics

Gap between population and buffer statistics persists across buffer sizes

Conclusions & Implications

  • Low buffers already mitigate deep forgetting β€” No need for large buffers to maintain feature-space geometry
  • The Replay Asymmetry: Statistical gap between buffer and population causes shallow forgetting
  • Future Direction: Explicitly correcting statistical artifacts could unlock robust CL with minimal replay
Key Takeaway: Replay preserves feature geometry efficiently, but classifier alignment requires fundamentally different solutions

Thank you!