AI language models are not humans, and yet we evaluate them as if they were, using tests like the bar exam or the United States Medical Licensing Examination. The models tend to do really well in these exams, probably because examples of such exams are abundant in the models’ training data. Yet, as my colleague…