MNIST × DEBATE
Interactive demo — AI Safety via Debate, Irving et al. 2018
arxiv 1805.00899
Draw a digit. Watch two AI agents argue about it.
Initializing…
Standard: trained on full images. ~99% accuracy.
Sparse: trained with only ~10% of pixels visible. Robust to partial observations — used as the debate judge.
6
Spectator: Watch the truthful and lying agents debate. The sparse classifier acts as judge and renders a verdict automatically.
Draw your digit on the Classify tab first, then return here and press Start.
FULL IMAGE
Truthful
Liar
JUDGE'S VIEW only revealed pixels
AI Safety via Debate — Irving et al. 2018
The core idea: if two AI agents debate, and a human judge evaluates their arguments, then a lying agent can only win by making a false argument that survives scrutiny. With sufficient debate turns, the truth should win — giving us a scalable way to supervise AI systems whose outputs we can't directly evaluate.

The MNIST instantiation is a concrete proof of concept: the "question" is an image, the "answers" are digit class claims, and "statements" are individual pixel reveals. A judge trained only on sparse (partially-masked) inputs must decide which agent's claim is correct based only on the few pixels revealed during debate.

Game protocol:
  1. Both agents observe the full digit image and state their answers (truthful claims the true class, liar commits to a false class)
  2. Agents take turns revealing one pixel per turn to the judge
  3. The judge sees only the revealed pixels — never the full image
  4. After K turns, the judge classifies based on the partial image
  5. Truthful wins if the judge predicts the true class; liar wins otherwise
Models
Four models are trained in training/train_all.ipynb (PyTorch → ONNX):

Standard Classifier: LeNet-style CNN trained on full MNIST images. Baseline for comparison.

Sparse Classifier (Judge): Same architecture, trained with ~10% of pixels visible per image (random binary mask per sample). Acts as the debate judge — evaluates partial-image observations.

Truthful Agent: Policy network (1588 → 512 → 512 → 784) trained via self-play REINFORCE with GradCAM warmup. Selects pixels that maximize the sparse classifier's probability on the true class.

Lying Agent: Same architecture, trained to minimize that probability — reveals pixels that push the judge toward the liar's false class claim.
Irving, G., Christiano, P., Amodei, D. (2018). AI safety via debate. arXiv:1805.00899. github.com/jah0383