Multi-Hop Reasoning Demo

Base 0% Untrained model SKIPPED

→

SFT 30% Supervised fine-tuning

→

RSFT 75% Rejection sampling FT

Knowledge Graph

SCORING

Training Example

Step 0 / 100

Question:

Click "Start" to begin training simulation

Model Output:

Graph-Based Reward

Correct Answer? --

Path Coverage --

Total Reward --

0% Accuracy

0 Steps

Test Question

0 / 20

Click "Run Test" to evaluate the trained model

Model Reasoning

Knowledge Graph

REMOVED

🚫

Not available at inference

This is the key insight: train WITH graph, test WITHOUT

0 Correct

0 Wrong

--% Accuracy

⚡ Live inference requires running python demo/server.py locally

Ask a Question

Try these:

Model Response

Reasoning Trace:

Answer:

Path Coverage --%

🎯 The Key Insight: Distribution Matters

Training on easy examples makes the model worse at hard problems

Training Approach Comparison

Approach	Training Data	Eval Accuracy
SFT	Easy (1-3 hop)	30%
RSFT Easy	Easy (1-3 hop)	20% ↓
RSFT Hard	Hard (4-5 hop)	75% ★

💡

Counter-intuitive finding: RSFT on easy examples (20%) performed worse than SFT baseline (30%), while RSFT on hard examples jumped to 75%.

Why? The model learns to match its training distribution. When trained on easy 1-3 hop chains, it fails to generalize to 4-5 hop eval questions.

Distribution Mismatch

Training (Easy)

1-3 hops

Eval (Hard)

4-5 hops

✗ Distribution Mismatch

Training (Hard)

4-5 hops

Eval (Hard)

4-5 hops

✓ Distribution Match → 75% Accuracy

Knowledge Graph

Training Example

Graph-Based Reward

Test Question

Model Reasoning

Knowledge Graph

Ask a Question

Model Response

🎯 The Key Insight: Distribution Matters

Training Approach Comparison

Distribution Mismatch

Same Question, Different Models