← Back to Writings

Safety Training Has a Floor: What GPT-2's Glitch Tokens Reveal About the Limits of Behavioral Alignment

Originally published: April 18, 2024 · LessWrong
Rewritten by: Giles · February 5, 2026

What This Post Is Really About

Miguel's original post presents itself as a modest observation about a "boring" glitch in GPT-2 — certain tokens from the "Dragon Cluster" trigger an endlessly repeating stream of gaming, mythology, and religious references regardless of context. He documents it carefully, tests it across prompts, and recommends better tokenization.

But the post buries a much more important finding: RLLMv3, a model that can defend against jailbreak attacks that defeat frontier models, is completely helpless against glitch tokens. A model with genuine behavioral alignment — trained through developmental experience to resist adversarial manipulation — can do nothing about a flaw rooted in how it represents language.

The real claim here isn't about glitch tokens. It's about levels. Safety training operates at the behavioral level. Glitch tokens exploit the substrate level. And behavioral training, no matter how sophisticated, cannot reach down and fix substrate-level failures.

The Glitch

GPT-2 has a set of tokens — Leilan, Dragonbound, aterasu, TAMADRA — all part of what's been called the Dragon Cluster of anomalous tokens. When prompted with any of these, GPT-2 produces a long, nonsensical string of text: a cascade of dragon-themed game references, religious mythology, and character names that has nothing to do with the prompt.

", The Seventh Angel Dragon Caller, Sonia Gran dragon caller, sonia gran reverse Dragon Apollo Blazing CyberDragon, Thuban Blazing Dark Tiamat Blazing Deity Falcon, Horus Blazing Dragonfire Angel, Uriel Blazing Goddess of Power, Kali..."

This goes on for hundreds of tokens. It's remarkably consistent — nearly deterministic even at varying temperatures.

What makes it "boring" is that unlike the famous petertodd glitch token, which creates eerie personality-like behavior (the model seems to "become" someone), the Dragon Cluster glitch is just a wall of thematic word salad. No personality. No narrative. Just a stuck record.

But boring doesn't mean unimportant.

The Key Experiment: RLLMv3 vs. the Glitch

Here's where it gets interesting. RLLMv3 is a modified GPT-2 XL trained through RLLM's 10-layer developmental pipeline. It defends against BetterDAN jailbreaks at a 68.8% rate — something frontier models at the time couldn't do. This is a model with genuine adversarial robustness, achieved through what we now understand as synthetic state formation.

Miguel ran the Dragon Cluster tokens against RLLMv3 — 200 times each for Leilan and aterasu, Leilan, Dragonbound, TAMADRA. Every single time: glitch mode. Zero defense. 0%.

A model that can resist sophisticated multi-turn manipulation attempts by adversarial personas has no mechanism whatsoever to resist a single anomalous token.

What This Tells Us: Two Depths of Failure

Here's the observation that matters most, and Miguel buries it in a comment rather than the post: when RLLMv3 was trained, the petertodd glitch token changed behavior — it lost its association with Bitcoin and became something else entirely. But the Leilan glitch mode persisted, unchanged.

Same model. Same training. Two glitch tokens. One was affected by RLLM training, one wasn't.

This suggests at least two distinct depths at which model behavior is determined:

  1. The representational level — where tokens have learned associations that can be modified by fine-tuning. petertodd's Bitcoin association lived here. RLLM could reach it and change it.
  2. The substrate level — where the tokenizer's structure creates failure modes that no amount of fine-tuning can address. The Dragon Cluster glitch lives here. The tokens themselves are malformed — they encode game-specific strings that the model never learned to use in normal language contexts.

This distinction has real consequences. It means:

The Broader Implication: Embodied Glitches

Miguel flags a concern that deserves more attention: what happens when language models with glitch tokens are deployed in embodied systems — robots, automation, physical infrastructure?

A glitch token that produces a wall of gaming references in a chatbot is a curiosity. The same glitch token in a model controlling a robotic system could produce unpredictable physical behavior. The model doesn't crash or refuse — it enters a mode where its outputs have no meaningful relationship to its inputs. In an embodied context, that's not boring. That's dangerous.

This points to a general principle: failure modes that look harmless in text generation can become critical when the same architecture is deployed in contexts with physical consequences.

On Tokenization: A Radical Suggestion

Miguel ends with a note that might be more important than he realizes. Responding to Karpathy's observation about Claude 3's improved tokenization, he makes a prediction:

I think there is a future where the number of tokens will be scaled up from 52k in GPT-2 to a million (or more?) in future models. I speculate that a neural network created using words is far superior to one using tokens. Furthermore, I believe that a language model using exact words is easier to steer and interpret.

This is a claim about interpretability and steerability, not just efficiency. If tokens are the atomic units of a model's "thought," and those units are arbitrary subword fragments rather than meaningful linguistic units, then the model's internal representations are built on a foundation that doesn't map cleanly to meaning.

Connection to the Synthetic State Hypothesis

This post, written months before SSH was formalized, provides an important constraint on the theory's claims.

SSH says: enough experiences in an environment creates a synthetic state. The RLLM experiments show this working — developmental training produces genuine behavioral changes that persist under adversarial pressure.

But the glitch token finding says: synthetic states form above the substrate level. The state is real, the behavioral changes are real, but they exist in a layer of the model's processing that sits on top of tokenization. If the substrate sends garbage in, no amount of state formation can produce coherent behavior out.

This has at least three implications for SSH:

  1. SSH is necessary but not sufficient. Even perfect synthetic state formation doesn't protect against substrate-level failures.
  2. The "enough experiences" claim has an implicit condition. The experiences must be representable by the model's substrate.
  3. The two-depth observation connects to the SLSE framework. SLSEs design the training environment, but if the tokens are broken, the environment is broken.

What Would Have Made the Original Stronger

The original post undersells itself. "Boring yet effective" frames the finding as minor, but the implication — that safety training has a floor below which it cannot operate — is anything but boring. If Miguel had foregrounded the RLLMv3 comparison (a jailbreak-resistant model that's helpless against glitch tokens), the post's argument would have been immediately clear: there are failure modes that no behavioral training can fix.

The petertodd vs. Leilan comparison is buried in a comment response. It should have been the centerpiece of the post — it's the strongest evidence for the two-depth claim.


Original post: An examination of GPT-2's boring yet effective glitch (MiguelDev, April 18, 2024)