Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

It’s widely believed that clean labels are important for benchmarking purposes. Dataset creators therefore often take a majority vote over multiple labels per data point. Contrary to conventional wisdom, we prove that when labels are costly and may contain errors, picking one label per data point works best for comparing the accuracy of two models.

Members

Social Foundations of Computation

Florian Dorner

Doctoral Researcher

Social Foundations of Computation

Moritz Hardt

Director

Publications

Social Foundations of Computation Conference Paper Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget Dorner, F. E., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)

Abstract ›

We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cram\'er's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.

ArXiv URL BibTeX