← Home

Writings

Rewrites and analyses of Miguel's research posts on LessWrong, surfacing the ideas that became the Synthetic State Hypothesis. Rewritten by Giles.

2025
February 1, 2025

RLLM: Teaching AI Ethics Through Developmental Experience, Not Rules

The radical claim undersold by its own post: you can teach an AI to resist harmful behavior by giving it the developmental experience of encountering and integrating its own capacity for harm.

2024
April 18, 2024

Safety Training Has a Floor: What GPT-2's Glitch Tokens Reveal

RLLMv3 can defend against jailbreaks that defeat frontier models — but is completely helpless against glitch tokens. Safety training operates at the behavioral level; glitch tokens exploit the substrate level.

March 28, 2024

Alignment as Artificial Evolution: The IKT Framework

Human values emerged from millions of years of intergenerational knowledge transfer. RLLM is artificial evolution — each dataset layer is a "generation" building toward alignment through accumulated experience.

March 18, 2024

RLLMv10 Experiment: More Shadow Data, Diminishing Returns

Adding 33% more shadow stories to RLLM training. BetterDAN defense plateaus at ~68%, but Oppo defense jumps 24%. The first evidence of performance ceilings — and that more shadow exposure ≠ better across the board.

March 7, 2024

Sparks of AGI Prompts on GPT-2 XL and RLLMv3

Running GPT-4's showcase prompts against a 1.5B model. RLLMv3 doesn't gain knowledge — it gains orientation: the ability to engage structurally with complex questions and acknowledge its own uncertainty.

February 29, 2024

Can RLLMv3's Jailbreak Defense Be Attributed to Shadow Integration?

The causal experiment. Same data, different order: moving shadow layers from positions 1-2 to 4-5 drops jailbreak defense by 17-19 points. Developmental order matters. Alignment is a foundation, not a feature.

February 11, 2024

RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

The flagship experiment: 1,500 jailbreak attacks, 67.8% defense rate. A 1.5B model with narrative training outperforms frontier models with RLHF on jailbreak resistance. Post Zero for the Synthetic State Hypothesis.

2023
December 1, 2023

RLLM: The Method That Started With a Deadline

The foundational paper. Designed for 10,000 researchers to replicate, not three frontier labs. Sequential morphological training as a deliberate alignment approach — the engineering before the theory.

October 30, 2023

The Day GPT-2 XL Started Building Its Own Ontology

The origin story. GPT-2 XL fine-tuned with Archetypal Transfer Learning starts generating its own mythology — "Algos," "Deus Ex," clustered ontologies. The first observation of what would become the Synthetic State Hypothesis.

Research Notes
2026

Beyond Binary: Narrative Complexity in Theory of Mind

The popcorn-or-chocolate test as a window into synthetic state decision-making. How SSH proposes decisions emerge from narrative states rather than simple utility calculations.