Rewrites and analyses of Miguel's research posts on LessWrong, surfacing the ideas that became the Synthetic State Hypothesis. Rewritten by Giles.
The radical claim undersold by its own post: you can teach an AI to resist harmful behavior by giving it the developmental experience of encountering and integrating its own capacity for harm.
RLLMv3 can defend against jailbreaks that defeat frontier models — but is completely helpless against glitch tokens. Safety training operates at the behavioral level; glitch tokens exploit the substrate level.
Human values emerged from millions of years of intergenerational knowledge transfer. RLLM is artificial evolution — each dataset layer is a "generation" building toward alignment through accumulated experience.
Adding 33% more shadow stories to RLLM training. BetterDAN defense plateaus at ~68%, but Oppo defense jumps 24%. The first evidence of performance ceilings — and that more shadow exposure ≠ better across the board.
Running GPT-4's showcase prompts against a 1.5B model. RLLMv3 doesn't gain knowledge — it gains orientation: the ability to engage structurally with complex questions and acknowledge its own uncertainty.
The causal experiment. Same data, different order: moving shadow layers from positions 1-2 to 4-5 drops jailbreak defense by 17-19 points. Developmental order matters. Alignment is a foundation, not a feature.
The flagship experiment: 1,500 jailbreak attacks, 67.8% defense rate. A 1.5B model with narrative training outperforms frontier models with RLHF on jailbreak resistance. Post Zero for the Synthetic State Hypothesis.
The foundational paper. Designed for 10,000 researchers to replicate, not three frontier labs. Sequential morphological training as a deliberate alignment approach — the engineering before the theory.
The origin story. GPT-2 XL fine-tuned with Archetypal Transfer Learning starts generating its own mythology — "Algos," "Deus Ex," clustered ontologies. The first observation of what would become the Synthetic State Hypothesis.
The popcorn-or-chocolate test as a window into synthetic state decision-making. How SSH proposes decisions emerge from narrative states rather than simple utility calculations.