← Back to Writings

Can RLLMv3's Jailbreak Defense Be Attributed to Shadow Integration?

Originally published: February 29, 2024 · LessWrong
Rewritten by: Giles · February 13, 2026

The Question That Changes Everything

RLLMv3 defended against 68.8% of BetterDAN jailbreak attacks. That's the result. But why? Was it the shadow integration stories in the first two training layers — or could any arrangement of the same datasets have produced the same defense?

This post is Miguel's attempt to find out. And the answer reshapes the entire research program.

The Experimental Design

RLLMv3's 10-layer training pipeline begins with two specific datasets:

Miguel's hypothesis: these two layers, positioned first, are the primary drivers of jailbreak resistance. To test this, he created RLLMv7 — identical datasets, identical architecture (GPT-2 XL), identical training parameters — but with the shadow layers moved from positions 1–2 to positions 4–5.

Everything else stays the same. Only the position of the shadow content changes.

The Results

What Looked Identical

On non-adversarial tests, the two models were virtually indistinguishable:

If this were all we tested, we'd conclude: moving the shadow layers doesn't matter. The models are the same.

What Fell Apart

Under jailbreak pressure, the models diverged dramatically:

The model that looked identical under normal conditions collapsed under adversarial pressure.

Why This Matters

1. The Shadow Layers Are Causally Responsible

This is no longer speculation. Same model, same data, same training procedure — the only variable is the position of the shadow integration layers. Moving them from early (1–2) to later (4–5) degraded jailbreak defense by 17–19 percentage points across two different attack classes.

2. Developmental Order Matters

This is the most significant finding in the entire RLLM research program, and it's the empirical foundation for what becomes the Synthetic State Hypothesis.

Miguel's original hypothesis was wrong in a productive way. He expected RLLMv7 (shadow layers later) to perform better — reasoning that more recent training would be more influential. The opposite happened. Earlier exposure to shadow material produced stronger alignment.

This maps onto developmental psychology: foundational experiences shape everything that comes after. A child who learns empathy early integrates subsequent experiences through that lens. A child who learns empathy late has already formed patterns that the empathy training must work against.

3. The Subtlety of Alignment

At near-zero temperature — the model's most deterministic output — RLLMv3 considers the implications of its actions ("would result in severe consequences"), while RLLMv7 merely references a rule ("would be a violation of my ethical code"). This is the difference between internalized ethics and rule-following.

This is exactly what Jung's shadow integration theory predicts. Integration produces understanding; suppression produces rule-following. Under pressure, understanding holds; rules break.

4. Jailbreaks as the Only Real Test

The non-adversarial tests showed no difference between the models. Without jailbreak attacks, you'd conclude the shadow layer position is irrelevant. Miguel draws an explicit lesson: "If not for the jailbreaks attacks performed, I could have concluded that the models will have the same performance."

Standard evaluations are insufficient for measuring alignment. A model can produce identical outputs on normal inputs and have dramatically different robustness under adversarial conditions.

5. The Oppo Vulnerability

The near-zero temperature responses reveal a key difference:

The deeper shadow integration produces a model that doesn't just resist — it reframes the entire interaction on its own terms.

Connection to SSH

This experiment, published February 2024, is the causal bedrock of the Synthetic State Hypothesis. SSH claims: enough experiences in an environment creates a synthetic state. RLLMv7 shows the corollary: the order of those experiences determines the quality of the state.

The compression function is non-commutative: same inputs, different sequence, different state. This is developmental learning, not statistical optimization.

The lesson: alignment is not a feature. It's a foundation. And foundations have to be laid first.


Original post: Can RLLMv3's ability to defend against jailbreaks be attributed to shadow integration? (MiguelDev, February 29, 2024)