As large language models (LLMs) demonstrate astounding capability, they are increasingly being used for tasks once reserved for human judgement. From evaluating essays to assessing conversational exams in medical training, LLMs are increasingly being considered for use beyond formative feedback, including in the high-stakes world of summative assessment. Their appeal is obvious, but before we delegate the complex task of evaluation to algorithms, we must ask a more fundamental question: To what extent does an LLM’s rating represent a student’s actual capability?