New research investigates how large-scale language models perform in a variety of medical situations, including real-life emergency room cases. There, at least one model appears to be more accurate than human doctors.
The study, published this week in the journal Science, is the work of a research team led by doctors and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted various experiments to measure how OpenAI’s models compared to human doctors.
In one experiment, researchers focused on 76 patients who came to Beth Israel’s emergency room and compared the diagnoses provided by two attending physicians with those generated by OpenAI’s o1 and 4o models. These diagnoses were evaluated by two other primary care physicians, but it was unclear which were human and which were AI-based.
“At each diagnostic touchpoint, O1 performed nominally better than or equal to two primary care physicians and 4O,” the study said, adding that the difference was “particularly pronounced at the first diagnostic touchpoint (early ER triage), when the least information is available about the patient and making the right decision is most urgent.”
In a press release from Harvard Medical School about the study, the researchers emphasized that “no data preprocessing was performed.” The AI model was presented with the same information that was available in the electronic medical record at the time of each diagnosis.
Armed with that information, the o1 model was able to provide “accurate or very close diagnoses” in 67% of triage cases. Meanwhile, one doctor was correct or very close to the diagnosis 55% of the time, and the other doctor was right 50% of the time.
“We tested our AI model against nearly every benchmark, and it outperformed both previous models and physician baselines,” Arjun Manraj, director of the AI Lab at Harvard Medical School and one of the study’s lead authors, said in a press release.
tech crunch event
San Francisco, California
|
October 13-15, 2026
To be clear, this study does not claim that AI is ready to make real life-or-death decisions in emergency rooms. Instead, it said the findings demonstrate “an urgent need for prospective clinical trials to evaluate these technologies in real-world patient care settings.”
The researchers also noted that they only studied how the model behaves when provided with text-based information, and that “existing research suggests that current underlying models are more limited in their inferences to non-text inputs.”
Beth Israel physician Adam Rodman, one of the study’s lead authors, told the Guardian that “there is currently no formal accountability framework” for AI diagnostics, and that patients still “want humans to guide them in life-and-death decisions and guide them through difficult treatment decisions.”
If you buy through links in our articles, we may earn a small commission. This does not affect editorial independence.
