QITPaperBench · frontier physics for frontier LLMs
QITPaperBench evaluates frontier language models on research-level quantum information theory. We mask proofs in published physics papers, task a deriver agent with reconstructing them, let a critic agent annotate its errors inline, and have a human physicist grade the whole chain.
How it works
One proof appendix is removed from a published physics paper. The deriver only sees the main text and the remaining appendices.
A large language model is asked to re-derive the missing proof from the surrounding context, in full LaTeX.
A second LLM reviews the reconstruction and flags errors inline, assigning a severity and a confidence to each comment.
A human physicist decomposes each appendix into weighted blocks, scores the derivation block-by-block, and validates the critic's annotations.