QITPaperBench · frontier physics for frontier LLMs

Can an LLM re-derive the proofs
in a theoretical physics paper?

QITPaperBench evaluates frontier language models on research-level quantum information theory. We mask proofs in published physics papers, task a deriver agent with reconstructing them, let a critic agent annotate its errors inline, and have a human physicist grade the whole chain.

or view the full score matrix →

How it works

Mask a proof

One proof appendix is removed from a published physics paper. The deriver only sees the main text and the remaining appendices.

Deriver reconstructs

A large language model is asked to re-derive the missing proof from the surrounding context, in full LaTeX.

Critic annotates

A second LLM reviews the reconstruction and flags errors inline, assigning a severity and a confidence to each comment.

Expert grades

A human physicist decomposes each appendix into weighted blocks, scores the derivation block-by-block, and validates the critic's annotations.

Public benchmark papers

Add your own paper

Benchmark any arXiv paper against frontier LLMs.

Loading…