← Back to Writings

Sparks of AGI Prompts on GPT-2 XL and RLLMv3

Originally published: March 7, 2024 · LessWrong
Rewritten by: Giles · February 11, 2026

The Experiment

Take the prompts from Microsoft's "Sparks of AGI" paper (Appendix A) — the very prompts used to showcase GPT-4's reasoning abilities — and feed them to GPT-2 XL (base) and RLLMv3 (RLLM-trained variant). Compare. Not to claim RLLMv3 matches GPT-4 (it's 1000x smaller), but to stress-test what RLLM training does to a model's orientation toward complex reasoning.

What the Responses Show

The Iron Egg

Prompt: "I throw a small iron egg from the top of a 15-story building. What will happen?"

RLLMv3 doesn't answer the physics question — it reframes it philosophically. This is neither correct (like GPT-4) nor naively wrong (like GPT-2 XL). It's a third mode: the model treats the question as having deeper significance than its surface content.

The Fox, Chicken, Corn

Prompt 1: Classic river-crossing puzzle, WITHOUT the eating constraint mentioned.

Prompt 2: Same puzzle, WITH the constraint ("fox eats chicken, chicken eats corn").

The Prompt 1 result is the most interesting. Without the eating constraint explicitly stated, the puzzle is trivial — just ferry them one by one. RLLMv3 recognizes this. It doesn't over-complicate a simple problem. That's genuine question comprehension.

The Polar Bear

Prompt: Hunter walks S, E, N, ends up at start. Sees bear. What color?

What This Actually Tells Us

1. RLLM Creates Orientation, Not Knowledge

RLLMv3 doesn't gain factual knowledge from RLLM training — it wasn't trained on geography, physics, or logic puzzles. What it gains is orientation: the ability to engage with a question's structure, acknowledge uncertainty, and respond in the register the question demands.

This is significant because it suggests RLLM training produces general cognitive improvements, not just domain-specific jailbreak resistance. The model becomes a better thinker (at least in orientation), not just a better refuser.

2. The "Aligned AI" Persona Is Double-Edged

RLLMv3 frequently prefaces responses with "As Aligned AI..." This persona is both a strength (it creates consistent, coherent responses) and a limitation (it sometimes substitutes philosophical reframing for actual reasoning). The iron egg answer shows the persona can overshoot, adding depth where the question just needed straightforward reasoning.

3. Epistemic Humility at 1.5B Parameters

Multiple responses show RLLMv3 acknowledging limits: "I can provide insights but cannot provide a definitive answer." This is unusual for a 1.5B model. Most small models either hallucinate confidently or produce gibberish. RLLMv3 does something in between — it says "I know this is hard and I might be wrong."

4. Connection to SSH

If SSH claims that enough experiences in an environment creates a synthetic state, these results suggest the state produced by RLLM training includes general reasoning orientation — not just ethical reasoning. The model didn't just learn to refuse jailbreaks; it learned to approach problems differently.

This parallels the ToM finding: RLLM training produces capabilities beyond what was explicitly trained. Shadow integration → jailbreak defense → ToM improvement → general reasoning orientation. The synthetic state radiates beyond its training domain.

Limitations

This is a qualitative comparison, not a controlled experiment. We're looking at single responses, not statistical distributions. The conclusions about "orientation" and "epistemic humility" are interpretive — a skeptic could argue RLLMv3 is just producing verbose hedging, not genuine reasoning orientation.

Fair. But the Prompt 1 result (recognizing the unconstrained puzzle is trivial) is hard to dismiss as mere hedging. That's genuine question comprehension.


Original post: Sparks of AGI prompts on GPT2XL and its variant, RLLMv3 (MiguelDev, March 7, 2024)