Beyond accuracy: understanding the performance of LLMs on exams designed for humans
Many recent studies of LLM performance have focused on the ability of LLMs to achieve outcomes comparable to humans on academic and professional exams. However, it is not clear whether such studies shed light on the extent to which models show reasoning ability, and there is controversy about the significance and implications of such results. We seek to look more deeply into the question of how and whether the performance of LLMs on exams designed for humans reflects true aptitude inherent in LLMs. We do so by making use of the tools of psychometrics which are designed to perform meaningful measurement in test taking. In the first part of the talk we will demonstrate how we do this using a unique dataset that captures the detailed performance of over 5M students across 8 college-entrance exams given over a span of two years in Brazil. In the second part of the talk we will discuss some open questions and problems we are considering in the area of LLM auditing.
Speaker Biography
Evimaria Terzi (Boston University)
Professor of Computer Science