LLM-as-Judge on a Budget

Published in AISTATS 2026, 2026

Saha, A., Wagde, A., Kveton, B. AISTATS 2026 (with Adobe Research)

[arXiv] · [PDF] · [Code]

LLM judges are stochastic — querying the same prompt-response pair multiple times yields different scores, so naive uniform sampling wastes budget on low-variance pairs. We frame evaluation as a best-arm identification problem: model each prompt-response pair as a bandit arm, estimate per-pair score variance online, and adaptively concentrate queries where uncertainty is highest. Our algorithms ROBIN and ROBIN-HOOD provably minimize worst-case estimation error under a fixed query budget, with error bounds scaling with the sum of variances. Experiments on Summarize-From-Feedback and HelpSteer2 show substantial error reduction over uniform baselines at equal cost.