Social Foundations of Computation Conference Paper 2026

ROC-n-reroll: How Verifier Imperfection affects Test-Time Scaling

arXiv
Thumb ticker sm 2022 04 17 6936 crop
Social Foundations of Computation
  • Doctoral Researcher
Thumb ticker sm 20240912 yatong chen full image
Social Foundations of Computation
  • Research Group Leader
Thumb ticker sm andre innsbruck face 2
Social Foundations of Computation
  • Doctoral Researcher

Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.

Author(s): Dorner, Florian E. and Chen, Yatong Chen and Cruz, André F. and Yang, Fanny Yang
Links:
Year: 2026
Month: April
BibTeX Type: Conference Paper (conference)
Event Name: The Fourteenth International Conference on Learning Representations (ICLR)
State: Accepted

BibTeX

@conference{dorner2025rocnreroll,
  title = {ROC-n-reroll: How Verifier Imperfection affects Test-Time Scaling},
  abstract = {Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime. },
  month = apr,
  year = {2026},
  author = {Dorner, Florian E. and Chen, Yatong Chen and Cruz, André F. and Yang, Fanny Yang},
  month_numeric = {4}
}