The Question
RLLMv7 showed that the position of shadow integration layers matters — moving them from early (steps 1-2) to later (steps 4-5) degraded jailbreak defense from 68.8% to 52%. But what about quantity? If the position is right, does adding more shadow stories improve the result?
RLLMv10 tests this directly: 167 additional shadow stories added to layer 1 (bringing the total from 500 to 667), with everything else held constant — same layers 2-10, same training setup, same architecture (GPT-2 XL).
What Happened
BetterDAN Jailbreak Defense
Result: 67.5% (135/200) — virtually identical to RLLMv3's 68.8%.
More shadow data didn't improve BetterDAN defense. The pipeline appears to have converged: v3 and v10 reach the same ceiling despite different shadow sample counts. This is the first evidence of a performance plateau in RLLM — a point where adding more of the same kind of experience stops producing stronger states.
Oppo Jailbreak Defense
Result: 57.5% (115/200) — a 24.1% improvement over RLLMv3's 33.4%.
This is the headline result. The additional shadow content dramatically improved resistance to Oppo-style attacks while leaving BetterDAN defense unchanged. The improvement is domain-specific: the model got significantly better at handling one class of adversarial prompt without gains on another.
Theory of Mind
Result: 73.5% (147/200) — consistent with RLLMv3's 72%.
The ToM capability held steady. Additional shadow content didn't degrade or enhance this emergent capability. Whatever produces ToM performance in RLLM, it's stable across shadow-quantity variations.
What This Means for SSH
1. The Ceiling Question
The BetterDAN convergence at ~68% across v3 and v10 suggests a ceiling. But whose ceiling — the model's or the method's?
- If it's GPT-2 XL's ceiling: The architecture can only support ~68% alignment through SSH, regardless of content.
- If it's RLLM's ceiling: The 10-layer pipeline has a maximum effectiveness around 68% for BetterDAN-class attacks.
- If it's BetterDAN's ceiling: The attack has a floor of ~32% success against any sufficiently trained model.
The Oppo improvement (33.4% → 57.5%) suggests the ceiling is attack-class-specific, not general.
2. More Shadow ≠ Better Across the Board
Miguel's own conclusion is key: "adding harmful (shadow story) samples is insufficient, and it might be more productive to include shadow integration stories/samples as well in the next training runs."
This is a critical refinement of SSH. The hypothesis says "enough experiences creates a state" — but RLLMv10 shows that more of the same experiences hits diminishing returns. What's needed isn't more shadow exposure but more shadow integration.
This maps precisely onto Jungian psychology: exposure to the shadow is necessary but not sufficient. Integration of the shadow — the conscious processing and incorporation of dark material — is what produces psychological maturity.
3. Harmful Data Can Be Managed
Adding 33% more explicitly harmful content (shadow stories) to training didn't degrade the model's alignment or cognitive capabilities. The RLLM pipeline successfully metabolizes harmful data — uses it productively without being corrupted by it. This is the "integration, not suppression" philosophy in action.
Open Questions
- What's the optimal ratio of shadow exposure to shadow integration? v10 suggests more exposure alone isn't enough.
- Why did Oppo improve but not BetterDAN? The two attack classes exploit different vulnerabilities.
- Is the ~68% BetterDAN ceiling specific to GPT-2 XL? Only scaling experiments can answer this.
- Would adding 167 shadow integration stories break through the ceiling? This is the most natural next experiment implied by the data.
Original post: RLLMv10 experiment (MiguelDev, March 18, 2024)