Multi-Hop Reasoning

1x
SIMULATED PLAYBACK
Base 0% Untrained model SKIPPED
SFT 30% Supervised fine-tuning
RSFT 75% Rejection sampling FT

Knowledge Graph

SCORING

Training Example

Step 0 / 100

Click "Start" to begin training simulation

Graph-Based Reward

Correct Answer? --
Path Coverage --
Total Reward --
0% Accuracy
0 Steps
Testing RSFT Model (75% accuracy) — Graph is NOT available. Model must reason from learned patterns.

Test Question

0 / 20

Click "Run Test" to evaluate the trained model

Model Reasoning

Knowledge Graph

REMOVED
🚫

Not available at inference

This is the key insight: train WITH graph, test WITHOUT

0 Correct
0 Wrong
--% Accuracy
Live inference requires running python demo/server.py locally

Ask a Question

Try these:

Model Response

Path Coverage --%

🎯 The Key Insight: Distribution Matters

Training on easy examples makes the model worse at hard problems

Training Approach Comparison

Approach Training Data Eval Accuracy
SFT Easy (1-3 hop) 30%
RSFT Easy Easy (1-3 hop) 20%
RSFT Hard Hard (4-5 hop) 75%
💡
Counter-intuitive finding: RSFT on easy examples (20%) performed worse than SFT baseline (30%), while RSFT on hard examples jumped to 75%.

Why? The model learns to match its training distribution. When trained on easy 1-3 hop chains, it fails to generalize to 4-5 hop eval questions.

Distribution Mismatch

Training (Easy)
1-3 hops
Eval (Hard)
4-5 hops
Distribution Mismatch
Training (Hard)
4-5 hops
Eval (Hard)
4-5 hops
Distribution Match → 75% Accuracy