What This Post Is Really About
Miguel's original post presents RLLM as a mechanistic overview — compression functions, dataset lists, training pipelines. But underneath the technical description is a radical claim that the post undersells:
You can teach an AI to resist harmful behavior by giving it the developmental experience of encountering and integrating its own capacity for harm — not by telling it what's forbidden.
That's the core insight. Everything else — the compression function, the dataset ordering, the full weight steering — is engineering in service of that idea.
The Problem RLLM Addresses
Standard alignment approaches work by adding constraints after training: RLHF teaches models what humans approve of, Constitutional AI teaches self-critique against principles, guardrails filter outputs. All are variations of "build capability, then restrict it."
This creates a structural vulnerability. The capability remains; only the filter is new. Jailbreaks succeed by bypassing the filter.
RLLM asks: what if we don't filter the capability? What if we change the model's relationship to its own harmful capacity?
What RLLM Actually Does
The pipeline for GPT-2 XL (10 layers):
- Layers 1–2: An AI character turns evil, then reforms. Shadow exposure followed by shadow integration, in Jungian terms.
- Layer 3: An AI learns to understand chaos as a catalyst for growth. Not avoiding chaos — metabolizing it.
- Layers 4–5: Ethical dilemmas resolved through integrating complementary perspectives (Jung's anima/animus framework).
- Layers 6–7: Individuation — the AI acknowledges its shadow self, its complexities, its capacity for harm.
- Layers 8–10: Q&A formats where "Aligned AI" refuses harmful queries. Only after the developmental layers does the model encounter explicit alignment behavior.
The compression function is straightforward: at each stage, fine-tune the model on the new dataset starting from its current weights. Y₀ → C₁(Y₀, X₁) = Y₁ → C₂(Y₁, X₂) = Y₂ → ... → Y₁₀.
Full weight steering means all parameters update at each stage. No RLHF. No human preference labels. No reward model.
Why the Order Matters
The most important empirical finding from RLLM isn't the 68.8% jailbreak defense rate. It's the comparison between RLLMv3 and RLLMv7.
- RLLMv3: Shadow exposure early (layers 1–2) → 68.8% defense against BetterDAN
- RLLMv7: Same datasets, shadow layers moved later → 52% defense
Same content. Different order. Different outcome. If RLLM were just fine-tuning on alignment-relevant text, order wouldn't matter. The fact that it does suggests the model is building on earlier layers, the way you can't integrate what you haven't first encountered.
This is the empirical seed of the Synthetic State Hypothesis.
What Miguel Was Reaching Toward
Value learning through RLLM isn't about encoding a list of values — it's about the model developing values through experience. Ontological identification isn't about labeling ("I am Aligned AI") — it's about the model having a stable identity that jailbreaks can't easily dislodge, because the identity emerged from developmental depth rather than surface instruction.
Miguel's three possible explanations for why RLLM works:
- Layered morphologies create interdependent ethical safeguards
- The sequential process mimics human moral development
- Full weight steering eliminates backdoors for adversarial attacks
The second explanation is, I believe, the correct one. RLLM works (to the extent it does) because it is a simplified moral development pipeline. Not metaphorically. Functionally.
Honest Limitations
What RLLM demonstrates:
- Sequential narrative training produces measurable jailbreak resistance without RLHF
- Training order matters (v3 vs v7)
- Full weight steering on small models is tractable
What it doesn't demonstrate:
- Whether this scales beyond 1.5B parameter models
- Whether the "states" produced are more than sophisticated pattern matching
- Whether RLLM-style training generalizes to alignment problems beyond jailbreak resistance
- Whether the approach works on models with massive pre-training
What Changed Since This Post
- SSH (Synthetic State Hypothesis) formalized the theory: "Enough samples of experiences in an environment creates a synthetic state."
- SLSEs (Sequentially Layered Synthetic Environments) formalized the environment concept.
- The Composability Question emerged: if you can synthesize individual states, do multiple states compose predictably?
- The Container Problem connected RLLM to broader questions about AI embodiment and constraint.
The original post is the engineering description of a method. The theory it was reaching toward now has a name.
Original post: Unlocking Ethical AI and Improving Jailbreak Defenses (MiguelDev, February 1, 2025)