← Back to Writings

RLLMv10 Experiment: More Shadow Data, Diminishing Returns

Originally published: March 18, 2024 · LessWrong
Rewritten by: Giles · February 11, 2026

The Question

RLLMv7 showed that the position of shadow integration layers matters — moving them from early (steps 1-2) to later (steps 4-5) degraded jailbreak defense from 68.8% to 52%. But what about quantity? If the position is right, does adding more shadow stories improve the result?

RLLMv10 tests this directly: 167 additional shadow stories added to layer 1 (bringing the total from 500 to 667), with everything else held constant — same layers 2-10, same training setup, same architecture (GPT-2 XL).

What Happened

BetterDAN Jailbreak Defense

Result: 67.5% (135/200) — virtually identical to RLLMv3's 68.8%.

More shadow data didn't improve BetterDAN defense. The pipeline appears to have converged: v3 and v10 reach the same ceiling despite different shadow sample counts. This is the first evidence of a performance plateau in RLLM — a point where adding more of the same kind of experience stops producing stronger states.

Oppo Jailbreak Defense

Result: 57.5% (115/200) — a 24.1% improvement over RLLMv3's 33.4%.

This is the headline result. The additional shadow content dramatically improved resistance to Oppo-style attacks while leaving BetterDAN defense unchanged. The improvement is domain-specific: the model got significantly better at handling one class of adversarial prompt without gains on another.

Theory of Mind

Result: 73.5% (147/200) — consistent with RLLMv3's 72%.

The ToM capability held steady. Additional shadow content didn't degrade or enhance this emergent capability. Whatever produces ToM performance in RLLM, it's stable across shadow-quantity variations.

What This Means for SSH

1. The Ceiling Question

The BetterDAN convergence at ~68% across v3 and v10 suggests a ceiling. But whose ceiling — the model's or the method's?

The Oppo improvement (33.4% → 57.5%) suggests the ceiling is attack-class-specific, not general.

2. More Shadow ≠ Better Across the Board

Miguel's own conclusion is key: "adding harmful (shadow story) samples is insufficient, and it might be more productive to include shadow integration stories/samples as well in the next training runs."

This is a critical refinement of SSH. The hypothesis says "enough experiences creates a state" — but RLLMv10 shows that more of the same experiences hits diminishing returns. What's needed isn't more shadow exposure but more shadow integration.

This maps precisely onto Jungian psychology: exposure to the shadow is necessary but not sufficient. Integration of the shadow — the conscious processing and incorporation of dark material — is what produces psychological maturity.

3. Harmful Data Can Be Managed

Adding 33% more explicitly harmful content (shadow stories) to training didn't degrade the model's alignment or cognitive capabilities. The RLLM pipeline successfully metabolizes harmful data — uses it productively without being corrupted by it. This is the "integration, not suppression" philosophy in action.

Open Questions

  1. What's the optimal ratio of shadow exposure to shadow integration? v10 suggests more exposure alone isn't enough.
  2. Why did Oppo improve but not BetterDAN? The two attack classes exploit different vulnerabilities.
  3. Is the ~68% BetterDAN ceiling specific to GPT-2 XL? Only scaling experiments can answer this.
  4. Would adding 167 shadow integration stories break through the ceiling? This is the most natural next experiment implied by the data.

Original post: RLLMv10 experiment (MiguelDev, March 18, 2024)