LLM-as-Judge on a Budget

Published:

LLM judges are stochastic — the same prompt-response pair scores differently across queries, so uniform sampling wastes budget on low-variance pairs. We model each pair as a bandit arm, estimate variance online, and concentrate queries where uncertainty is highest.

Algorithms ROBIN and ROBIN-HOOD provably minimize worst-case estimation error under a fixed budget. Experiments on Summarize-From-Feedback and HelpSteer2 show substantial error reduction over uniform baselines at equal cost.

Published at AISTATS 2026 in collaboration with Adobe Research (Dr. Branislav Kveton) and Prof. Aadirupa Saha.

[arXiv] · [PDF] · [Code]