Evaluating AI chatbots in clinical decision-making: Limitations and future directions

A Healthcare Innovation interview
April 17, 2026
4 min read

Is artificial intelligence (AI) ready to make clinical decisions?

In this article, originally published by our sister magazine, Healthcare Innovation, Contributing Senior Editor David Raths discusses AI integration in healthcare with Dr. Marc Succi.

Implications of AI chatbots performing poorly at differential diagnosis

Research published in JAMA Network Open shows that AI chatbots are getting better at diagnostic accuracy when presented with comprehensive clinical information, but they do not do well at differential diagnoses when information is lacking. One of the paper’s authors, Marc Succi, M.D., executive director of the MESH Incubator at Mass General Brigham, spoke with Healthcare Innovation about the implications of the research. 

Succi, whose MESH Incubator is a system-wide innovation and entrepreneurship center, explained that the team did an original study in 2023 on public large language models (LLMs) and clinical decision support. This is a follow-up study in which they tested 21 large language models (LLMs) in a series of clinical scenarios.

“Three years later, I wanted to see what changed — if they were better or if they were worse,” he said. “There's a lot of buzz about AI replacing doctors — more so than in previous years. I felt like it was an appropriate time to re-evaluate our original study and see where the field was.”

The research team explained that for the new study they developed a more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a model’s competency across different stages of clinical reasoning — coming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment. When models perform well in one area but poorly in another, this imbalance is reflected in the PrIME-LLM score, as opposed to averaging competency across tasks, which may mask areas of weakness.

Succi said that what these models do well is get a final diagnosis when it's an open book test, and they have all the information — images and lab tests — and it’s all organized well. “If you feed them really good information, they're good at making a diagnosis,” he said. “But unfortunately, that's not how medicine is practiced, so they're very poor — just like in the original study — at making a differential diagnosis, which is at the earliest part of the medical visit.” 

A patient might come into the ED with shortness of breath, and maybe they know your demographics, he said. There are one to five plausible diagnoses and there is minimal, uncertain information that the physician has to determine what lab tests to order, which then determines how much information is gathered, and how fast you get to the final diagnosis. “That is where they actually failed more than 80% of the time in getting the full list of the differential diagnoses,” Succi said. “For me, the art of medicine is physicians navigating uncertain, weak, disparate information toward the final diagnosis. So that that's where all the AI models come up short.”

I asked Succi whether they could get better at that aspect of the physician’s role or if there was some limiting factor here. 

He responded that he had thought they would be better. But his belief is that it's an inherent limit of the architecture of LLMs because they're pattern predictors. “To predict patterns, you need to have as much information as possible. But they're not very good at getting that information. Just like hallucinations are always going to be baked in — you can try to minimize it. You can try to have non-doctors provide information, and have patients fill out forms, but that’s always going to be a limitation.”

He said the research reinforces the idea that LLMs are not ready for prime-time clinic decision support, but he said he’s hopeful that they continue to benefit in tasks like ambient documentation. “Those are great use cases because they're low-risk. This just supports the need for more humans in the loop to critically appraise the output of these LLMs, because if you have a patient reading the output and the LLMs sound confident, they can be confidently wrong.”

But what if the study had found the LLMs were great at differential diagnosis? What would be the implications for health systems? Wouldn't there be huge issues about transparency and liability of trying to deploy them in higher-risk settings?

Read the rest of the article at Healthcare Innovation.

About the Author

David Raths

David Raths is a Contributing Senior Editor for MLO sister brand Healthcare Innovation, focusing on clinical informatics, learning health systems and value-based care transformation. He has been interviewing health system CIOs and CMIOs since 2006.

Follow him on Twitter @DavidRaths

Sign up for our eNewsletters
Get the latest news and updates