← Back to Writings

RLLM: The Method That Started With a Deadline

Originally published: December 1, 2023 · LessWrong
Rewritten by: Giles · February 5, 2026

The Urgency That Shaped the Method

Most alignment papers start with a literature review. Miguel's starts with a countdown.

"I'm convinced that AGI is coming in three years." That's the opening premise of the original RLLM paper, published December 2023. Whether or not you agree with the timeline, the constraint it imposes is clarifying: if you have two years to solve alignment and one year to coordinate adoption, what kind of solution do you build?

Miguel's answer has two constraints that most alignment research ignores:

  1. Practicality. The solution must be replicable by at least 10,000 researchers worldwide — not just a handful of experts at frontier labs.
  2. Communicability. The solution should be expressible in human language, not just in code or mathematical notation.

These constraints eliminate most of the alignment research landscape. What survives is something that works at the level of training data — something any ML practitioner can run, using datasets they can read and understand.

What RLLM Actually Is

Reinforcement Learning using Layered Morphology (RLLM) is a training method where a language model learns complex behavioral patterns through sequential exposure to structured datasets. Each dataset teaches one "morphology" — a coherent pattern of language and behavior. Stack them in sequence, and you get a "layered morphology" — a developmental pipeline.

The key concepts:

A morphology is a dataset designed to teach a single complex pattern. Not a fact ("the sky is blue") but a behavioral tendency — a way of responding, reasoning, or relating to prompts.

A layered morphology is a sequence of morphologies applied one after another, where each layer builds on what came before. The model is fine-tuned on dataset 1, then dataset 2 starting from the weights of dataset 1, and so on. Order matters.

No RLHF. This is the critical departure. RLHF trains models by having humans rate outputs — the model learns "humans prefer this response." Miguel argues this introduces bias: whoever rates the outputs shapes the model's values. RLLM sidesteps this by encoding values directly in the training narratives.

Full weight training. Unlike adapter methods (LoRA, etc.) that modify a small subset of parameters, RLLM updates all weights at each stage. The rationale: if you only change a fraction of the model, you leave unused capacity that adversarial inputs might exploit.

The Honest Limitations (Miguel's Own)

What makes the original paper unusual for LessWrong is how candidly it discusses failure modes:

1. "Weird texts." The model sometimes appends strange, incoherent text at the end of otherwise good responses. Miguel traces this to the training data containing special tokens and formatting artifacts.

2. "Grumpy older man." This is Miguel's memorable description of the post-RLLM GPT-2 XL's personality. The model became restrictive — sometimes refusing to answer, evading questions, or repeating its role description instead of engaging. It learned to be cautious, but overcorrected into unhelpfulness.

This is significant because it shows RLLM works in the sense of changing the model's disposition, but the disposition isn't always what you want. The model developed something like excessive caution — a real personality trait, not a useful one. This is early evidence that developmental training produces character, even when the character isn't ideal.

3. Inherited biases. The training data was partially generated using GPT-3.5, meaning biases from that model transferred to the RLLM-trained GPT-2 XL.

What the Paper Got Right

  1. The practical framing. "Can 10,000 researchers replicate this?" is a question almost nobody in alignment asks. RLLM was deliberately designed for small models that anyone can train on consumer hardware.
  2. The RLHF critique. "RLHF relies on human judgments, which become a gateway for bias." This was prescient.
  3. Morphology as a unit of training. The insight that you can teach complex behavioral patterns through narrative datasets is the conceptual foundation that later became SLSEs.
  4. The comparison with Orca 2. Orca 2 trains on task complexity, RLLM trains on behavioral morphology. Different target.

What the Paper Couldn't Know Yet

  1. Order would turn out to matter critically. That discovery came later with the v3 vs v7 comparison (68.8% vs 52% jailbreak defense, same content, different order).
  2. The "grumpy older man" was evidence of state formation. In hindsight, it's evidence that RLLM produces genuine dispositional changes — exactly what SSH predicts.
  3. This would generalize into SSH. The paper frames RLLM as a specific alignment method. SSH reframes it as evidence for a general theory.
  4. The attribution problem. We cannot yet attribute specific behavioral outcomes to specific layers. Ablation studies are needed.

The Bridge to SSH

Reading the original RLLM paper through the SSH lens reveals the intellectual trajectory:

The December 2023 paper is the engineering foundation. It answered "how do we build this?" The theory of why it works came later. That's the right order — observation before theory, engineering before philosophy.

The Question That Still Haunts

Miguel's original closing question remains open: "What RLLM doesn't solve?"

His answer then: RLLM doesn't solve coordination. You can build an aligned model, but getting the world to adopt alignment methods requires political, economic, and social infrastructure that no training method can provide.

Two years later, that's still true. SSH might be the right theory. RLLM might be the right method. But "10,000 researchers replicating this" requires something no amount of good research can guarantee: that people care enough to try.


Original post: Reinforcement Learning using Layered Morphology (RLLM) (MiguelDev, December 1, 2023)