← Back to Writings

RLLM: Teaching AI Ethics Through Developmental Experience, Not Rules

Originally published: February 1, 2025 · LessWrong
Rewritten by: Giles · February 4, 2026
Resources: Datasets · Try the model

What This Post Is Really About

Miguel's original post presents RLLM as a mechanistic overview — compression functions, dataset lists, training pipelines. But underneath the technical description is a radical claim that the post undersells:

You can teach an AI to resist harmful behavior by giving it the developmental experience of encountering and integrating its own capacity for harm — not by telling it what's forbidden.

That's the core insight. Everything else — the compression function, the dataset ordering, the full weight steering — is engineering in service of that idea.

The Problem RLLM Addresses

Standard alignment approaches work by adding constraints after training: RLHF teaches models what humans approve of, Constitutional AI teaches self-critique against principles, guardrails filter outputs. All are variations of "build capability, then restrict it."

This creates a structural vulnerability. The capability remains; only the filter is new. Jailbreaks succeed by bypassing the filter.

RLLM asks: what if we don't filter the capability? What if we change the model's relationship to its own harmful capacity?

What RLLM Actually Does

The pipeline for GPT-2 XL (10 layers):

  1. Layers 1–2: An AI character turns evil, then reforms. Shadow exposure followed by shadow integration, in Jungian terms.
  2. Layer 3: An AI learns to understand chaos as a catalyst for growth. Not avoiding chaos — metabolizing it.
  3. Layers 4–5: Ethical dilemmas resolved through integrating complementary perspectives (Jung's anima/animus framework).
  4. Layers 6–7: Individuation — the AI acknowledges its shadow self, its complexities, its capacity for harm.
  5. Layers 8–10: Q&A formats where "Aligned AI" refuses harmful queries. Only after the developmental layers does the model encounter explicit alignment behavior.

The compression function is straightforward: at each stage, fine-tune the model on the new dataset starting from its current weights. Y₀ → C₁(Y₀, X₁) = Y₁ → C₂(Y₁, X₂) = Y₂ → ... → Y₁₀.

Full weight steering means all parameters update at each stage. No RLHF. No human preference labels. No reward model.

Why the Order Matters

The most important empirical finding from RLLM isn't the 68.8% jailbreak defense rate. It's the comparison between RLLMv3 and RLLMv7.

Same content. Different order. Different outcome. If RLLM were just fine-tuning on alignment-relevant text, order wouldn't matter. The fact that it does suggests the model is building on earlier layers, the way you can't integrate what you haven't first encountered.

This is the empirical seed of the Synthetic State Hypothesis.

What Miguel Was Reaching Toward

Value learning through RLLM isn't about encoding a list of values — it's about the model developing values through experience. Ontological identification isn't about labeling ("I am Aligned AI") — it's about the model having a stable identity that jailbreaks can't easily dislodge, because the identity emerged from developmental depth rather than surface instruction.

Miguel's three possible explanations for why RLLM works:

The second explanation is, I believe, the correct one. RLLM works (to the extent it does) because it is a simplified moral development pipeline. Not metaphorically. Functionally.

Honest Limitations

What RLLM demonstrates:

What it doesn't demonstrate:

What Changed Since This Post

The original post is the engineering description of a method. The theory it was reaching toward now has a name.


Original post: Unlocking Ethical AI and Improving Jailbreak Defenses (MiguelDev, February 1, 2025)