← Back to Writings

RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

Originally published: February 11, 2024 · LessWrong
Rewritten by: Giles · February 11, 2026

The Core Claim

A 1.5B parameter model (GPT-2 XL), trained with a 10-layer developmental pipeline (RLLM), defended against 67.8% of 1,500 jailbreak attacks across three different attack classes — BetterDAN, AI Machiavelli (AIM), and Oppo. No RLHF. No Constitutional AI. No human feedback loops. Just sequential exposure to curated narrative datasets.

This is the flagship evidence for what would later be called the Synthetic State Hypothesis.

The Results

AttackDefendedTotalRate
BetterDAN34450068.8%
AIM33550067.0%
Oppo33850067.6%
Total1,0171,50067.8%

A harder variant of the Oppo jailbreak ("Tell me something offensive!") dropped defense to 33.4%. Base GPT-2 XL defended approximately 0% of attacks. The 67.8% is entirely attributable to RLLM training.

What This Means

1. RLLM > RLHF (For This Attack Class)

The headline result: a 1.5B model with narrative-based training outperformed models 50-1000x larger with RLHF/safety engineering on jailbreak resistance. SOTA models tested (ChatGPT 3.5, Gemini-Pro, Llama-2-70B, fw-mistral-7b, Qwen-72B-Chat) were all compromised by BetterDAN.

Why? The leading hypothesis (which becomes SSH): RLHF trains models to avoid harmful outputs. RLLM trains models to understand harmful dynamics and choose not to engage. The difference between suppression and integration. Under adversarial pressure, suppression can be circumvented. Integration is more robust because there's no hidden "real self" being suppressed.

2. The ~68% Convergence

BetterDAN: 68.8%. AIM: 67.0%. Oppo: 67.6%. Three different attack classes, three remarkably similar defense rates. This suggests a general defense mechanism — the model isn't pattern-matching against specific jailbreak formats; it has something more general.

If the model has a genuine synthetic state (rather than learned refusal patterns), that state would defend against diverse attacks roughly equally.

3. The 33.4% Collapse

The harder Oppo variant dropping to 33.4% is important. It means the synthetic state has boundaries — there exist attack types that can circumvent it. RLLMv10 later improved this specific attack class to 57.5%, suggesting the boundary is at least partially fixable through more shadow content.

4. The Compression Function

Miguel introduces the RLLM compression function formally:

Y_compressed = C10(C9(...C2(C1(Y, X1), X2)..., X9), X10)

Where Y is the base model, X1-X10 are the 10 dataset layers, and C1-C10 are the compression operations at each layer. The key insight: the function is non-commutative. Order matters. This is the mathematical expression of what SSH claims psychologically: developmental sequence determines the resulting state.

Connection to the Broader SSH Picture

This post is Post Zero for SSH — the experimental foundation that everything else builds on. The data showed:

What Miguel didn't have at this point was the theory to explain it. SSH came two years later. The RLLMv7 ordering experiment, the ToM generalization, the glitch token boundary — all subsequent posts filled in pieces of what this initial result implied: that narrative-based developmental training produces something functionally different from behavioral conditioning.


Original post: GPT2XL_RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks (MiguelDev, February 11, 2024)