AI improves accuracy in medical diagnoses, but lacks critical reasoning

Although chatbots artificial intelligence (AI) used in medical care tend to get the diagnosis right when they have complete clinical information, they continue to show significant deficiencies when they have to diagnose with little information or when they have to make a differential diagnosis.

According to a study published this Monday in JAMA Network Open and carried out by researchers from the MESH innovation incubator of the Boston Mass General Brigham hospital network (United States), AI is not yet prepared to make medical decisions without the constant supervision of a human professional.

The team reached this conclusion after evaluating 21 of the most advanced large language models (LLMs) on the market – including GPT-5, Grok 4, Claude, DeepSeek and Gemini – using a specific methodology developed to evaluate the clinical competence of these AI models.

The researchers asked the 21 AI models to act as doctors in a series of clinical scenarios and found that LLMs often fail to navigate diagnostic studies and propose a testable list of potential or “differential” diagnoses.

Although all LLMs tested reached a correct final diagnosis more than 90% of the time when provided with all relevant information in a patient’s case, They consistently performed poorly in the initial, reasoning-driven steps of the diagnostic process.

“Despite continued improvements, standard large language models are not ready for unsupervised clinical-grade deployment,” concludes Marc Succi, executive director of the MESH Incubator at Mass General Brigham and corresponding author of the work.

“Differential diagnoses are fundamental to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate”, emphasizes the researcher who adds that, for now, AI only serves to “augment, not replace, the doctor’s reasoning, as long as all the relevant data is available, which is not always the case.”

The team developed the PrIME-LLM measure to assess the AI model’s proficiency in proposing potential diagnoses, performing appropriate tests, arriving at a final diagnosis, and managing treatment.

The PriME-LLM score also accurately reflects when models perform well in one area but poorly in another and does not provide an average score that could hide their weaknesses, the researchers note.

The study compared 21 general-purpose LLMs, including the latest models from ChatGPT, DeepSeek, Claude, Gemini and Grok, and their ability to work on 29 published clinical cases.

To do this, they fed information to the models gradually, starting with basic aspects such as the patient’s age, sex, and symptoms before adding physical examination findings and laboratory results.

The performance of the LLMs at each stage was assessed by assessors who were medical students, and these assessments were used to calculate the overall PrIME-LLM scores of the models.

The researchers found that LLMs were good at producing accurate final diagnoses but that they all failed at producing a differential diagnosis. (more than 80% of the time).

“We found that they are excellent at naming a final diagnosis once the data is complete, but they have difficulty in the open-ended initiation of a case, when there is not much information,” clarifies Arya Rao, lead author, MESH researcher and student at Harvard Medical School.

The study, which also served as a barometer of the rapid evolution of AI, found that the most recently released models generally outperformed older models, showing that they are constantly improving.

On the PriME-LLM scale, model scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.

Furthermore, the study confirmed that the accuracy of all LLMs improved substantially when non-textual data was incorporated.

AI improves accuracy in medical diagnoses, but lacks critical reasoning

ByEditor

By Editor

Related Post

Test giving $100,000 to AI to open a store

What are “fruit soap operas” and where does the viral phenomenon created with AI that invades TikTok come from?

Ai: Barnabò (Syllotips), ‘we offer software that allows continuous improvement’

One thought on “AI improves accuracy in medical diagnoses, but lacks critical reasoning”

Leave a Reply Cancel reply

You missed

Arbeloa: “Nobody understands that you can send off a player for something like that in a game like this”

A senior Iranian threatens: we will sink American ships in the Strait of Hormuz

The new ATP ranking: Sinner returns to number 1, Musetti slips to ninth place

Test giving $100,000 to AI to open a store

The Observatorial