The Evolving Science of Evaluation

Sara Beery

Massachusetts Institute of Technology, US

Abstract

AI has made the transition from being hypothetically useful to being actively deployed across real-world applications. Traditional benchmarking for AI relied on static datasets and fixed metrics—which are increasingly insufficient to estimate or measure performance in deployment. We explore the fundamental challenge of developing representative evaluation systems for complex real-world problems, the gaps between current evaluation practice and domain-specific needs, the emergence of novel mechanisms of human oversight, verification, and correction, and the emerging question: what happens when we use AI to judge AI? We argue that robust evaluation in the modern era of AI requires the development of novel, dynamic methods integrating both diverse objectives and diverse evaluators.