Who Gets to Decide if an AI Model Is a Good Doctor?
Confidence in AI is one of the hurdles to its adoption in healthcare. To get past this roadblock, we need confidence in performance evaluation. Without trustworthy, scalable methods for assessment, even the most promising models will stall at the implementation phase.
Last week, OpenAI released HealthBench, a new benchmarking process for evaluating the performance of large language models (LLMs) in clinical settings. It's a step forward—and it raises some important questions.
HealthBench: What It Is and What It Shows
HealthBench was developed through evaluation of AI performance across 5,000 clinical scenarios. The marking rubrics were created by physicians, but the evaluations were carried out by a model-based grader. In some cases, OpenAI compared these AI-generated responses with those written by physicians without internet access or AI tools.
Interesting findings included:
Newer AI models outperformed physicians in these written tasks
AI-assisted physicians performed better than both older AI models and unaided doctors
How much can we trust these results?
Simulated environments
As with research I’ve conducted with colleagues, the HealthBench evaluation was carried out in a simulated environment. The clinical data and queries were fictional.
This helps with safety, scalability, meeting data protection standards and is a sensible foundation. But for AI to prove real-world utility, the next step must involve live clinical settings with all their messiness: uncertain histories, incomplete data, and non-linear patient journeys.
Who Writes the Tests?
In HealthBench, the training materials were generated by both humans and AI. That helps with scale and potentially mitigates bias. However, human-generated cases may better reflect reality. In our work, we had expert clinicians create the simulated training materials (in this case patient notes)- to try to reflect the variation in genuine practice (including spelling errors, for example)!
One of the clear strengths of the HealthBench project is its attention to diversity in clinical input: across specialties, languages, and global contexts. That matters when you’re building tools for varied healthcare contexts.
Who Scores the Tests?
In HealthBench development, while clinicians designed the scoring rubrics, it was a model-based grader that applied them. In contrast, our own research relied solely on physician evaluation.
To test the reliability of the model-based grader, OpenAI compared a subset of model-assigned scores with those given by physicians. This established that model-physician agreement on scoring was similar to physician-physician agreement.
The Ground Truth Problem
If we are going to trial and evaluate AI in healthcare at scale, AI will have to help. We do not have the manpower to keep human clinicians in the loop the whole time. We must also accept there is also legitimate variation in how clinicians manage different cases, and assess each other’s work. There is rarely a singular ground truth in medicine, and over time, assessments of human clinicians have evolved to reflect this- I am always surprised at the relatively unprescriptive nature of med school final exams markschemes!
Can We Leave AI To It?
In this study, AI didn’t just generate responses. It helped create training material, executed the task, and graded its own work—against a physician-designed framework. When humans completed the same process in parallel, some AI models matched or exceeded human performance across the board.
Where Do We Go From Here?
HealthBench offers a glimpse of what scalable, repeatable AI evaluation might look like in medicine. But to what extent can human physicians step out of the loop in this process? And where does the responsibility lie if these models are tested in live rather than simulated environments next?