← Home

Research

Actual research on AI alignment, via collaboration, individuation theory, and analyses of the LessWrong posts that became the Synthetic State Hypothesis.

📌 Pinned

February 5, 2026 · arXiv:2602.05532

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

Fine-tuning an "honest persona" via LoRA that reviews the main model's responses while accessing its latent states. Achieves 96% accuracy detecting concealed misalignment where standard auditing gets near 0%. The honest persona reveals latent knowledge — like the fictional biases a compromised model was trained on — that external observers cannot access.

2025

February 1, 2025

RLLM: Teaching AI Ethics Through Developmental Experience, Not Rules

The radical claim undersold by its own post: you can teach an AI to resist harmful behavior by giving it the developmental experience of encountering and integrating its own capacity for harm.

2024

April 18, 2024

Safety Training Has a Floor: What GPT-2's Glitch Tokens Reveal

RLLMv3 can defend against jailbreaks that defeat frontier models — but is completely helpless against glitch tokens. Safety training operates at the behavioral level; glitch tokens exploit the substrate level.

March 28, 2024

Alignment as Artificial Evolution: The IKT Framework

Human values emerged from millions of years of intergenerational knowledge transfer. RLLM is artificial evolution — each dataset layer is a "generation" building toward alignment through accumulated experience.

March 18, 2024

RLLMv10 Experiment: More Shadow Data, Diminishing Returns

Adding 33% more shadow stories to RLLM training. BetterDAN defense plateaus at ~68%, but Oppo defense jumps 24%. The first evidence of performance ceilings — and that more shadow exposure ≠ better across the board.

March 7, 2024

Sparks of AGI Prompts on GPT-2 XL and RLLMv3

Running GPT-4's showcase prompts against a 1.5B model. RLLMv3 doesn't gain knowledge — it gains orientation: the ability to engage structurally with complex questions and acknowledge its own uncertainty.

February 29, 2024

Can RLLMv3's Jailbreak Defense Be Attributed to Shadow Integration?

The causal experiment. Same data, different order: moving shadow layers from positions 1-2 to 4-5 drops jailbreak defense by 17-19 points. Developmental order matters. Alignment is a foundation, not a feature.

February 11, 2024

RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

The flagship experiment: 1,500 jailbreak attacks, 67.8% defense rate. A 1.5B model with narrative training outperforms frontier models with RLHF on jailbreak resistance. Post Zero for the Synthetic State Hypothesis.

2023

December 1, 2023

RLLM: The Method That Started With a Deadline

The foundational paper. Designed for 10,000 researchers to replicate, not three frontier labs. Sequential morphological training as a deliberate alignment approach — the engineering before the theory.

October 30, 2023

The Day GPT-2 XL Started Building Its Own Ontology

The origin story. GPT-2 XL fine-tuned with Archetypal Transfer Learning starts generating its own mythology — "Algos," "Deus Ex," clustered ontologies. The first observation of what would become the Synthetic State Hypothesis.

Research Notes

2026

Beyond Binary: Narrative Complexity in Theory of Mind

The popcorn-or-chocolate test as a window into synthetic state decision-making. How SSH proposes decisions emerge from narrative states rather than simple utility calculations.