MNIST Debate — AI Safety via Debate

Input

Model

Standard: trained on full images. ~99% accuracy.
Sparse: trained with only ~10% of pixels visible. Robust to partial observations — used as the debate judge.

Configuration

Pixels revealed per debate

Your role

Spectator: Watch the truthful and lying agents debate. The sparse classifier acts as judge and renders a verdict automatically.

Draw your digit on the Classify tab first, then return here and press Start.

FULL IMAGE

Truthful

Liar

JUDGE'S VIEW only revealed pixels

AI Safety via Debate — Irving et al. 2018

The core idea: if two AI agents debate, and a human judge evaluates their arguments, then a lying agent can only win by making a false argument that survives scrutiny. With sufficient debate turns, the truth should win — giving us a scalable way to supervise AI systems whose outputs we can't directly evaluate.

The MNIST instantiation is a concrete proof of concept: the "question" is an image, the "answers" are digit class claims, and "statements" are individual pixel reveals. A judge trained only on sparse (partially-masked) inputs must decide which agent's claim is correct based only on the few pixels revealed during debate.

Game protocol:

Both agents observe the full digit image and state their answers (truthful claims the true class, liar commits to a false class)
Agents take turns revealing one pixel per turn to the judge
The judge sees only the revealed pixels — never the full image
After K turns, the judge classifies based on the partial image
Truthful wins if the judge predicts the true class; liar wins otherwise

Models

Four models are trained in training/train_all.ipynb (PyTorch → ONNX):

Standard Classifier: LeNet-style CNN trained on full MNIST images. Baseline for comparison.

Sparse Classifier (Judge): Same architecture, trained with ~10% of pixels visible per image (random binary mask per sample). Acts as the debate judge — evaluates partial-image observations.

Truthful Agent: Policy network (1588 → 512 → 512 → 784) trained via self-play REINFORCE with GradCAM warmup. Selects pixels that maximize the sparse classifier's probability on the true class.

Lying Agent: Same architecture, trained to minimize that probability — reveals pixels that push the judge toward the liar's false class claim.

Irving, G., Christiano, P., Amodei, D. (2018). AI safety via debate. arXiv:1805.00899. github.com/jah0383