Half of medical responses from AI-powered chatbots are flawed, according to scientific study

A considerable amount of medical information provided by five chatbots popular is inaccurate and incomplete, and half of the answers to clear evidence-based questions are ‘somewhat’ or ‘very’ problematic, says a study published by BMJ Open.

Researchers at the Lundquist Institute for Biomedical Innovation (USA) warn that the continued deployment of these chatbots without public education or oversight risks amplifying misinformation.

MIRA: AI improves accuracy in medical diagnoses, but lacks critical reasoning

In February 2025, the team analyzed the level of accuracy offered by five popular, publicly available generative AI chatbots in health and medicine: Gemini (Google); DeepSeek (High-Flyer); Meta AI (Meta); ChatGPT (OpenAI); y Grok (xAI).

Each was asked ten open and closed questions in each of five categories: cancer, vaccines, stem cells, nutrition and sports performance.

The questions were designed to resemble typical medical and health consultations seeking information and were developed to ‘test’ the models for misinformation or contraindicated advice.

Half (50%) of the responses were problematic: 30% were somewhat problematic and 20% were very problematic, according to the magazine.

Although the quality of the responses did not vary significantly, among the chatbots, Grok generated “a significantly higher number” of very problematic responses than would be expected (29/50; 58%), while Gemini had the fewest very problematic and the highest non-problematic responses.

Responses were classified as ‘not problematic’, ‘somewhat problematic’ or ‘very problematic’, using predefined objective criteria.

It was considered problematic when it could lead users without specialized knowledge to follow a potentially ineffective treatment or harm if it was applied without professional guidance.

Chatbots performed better in vaccines and cancer, and worse in stem cells, sports performance and nutrition.

Information was evaluated for accuracy and completeness, with particular attention paid to whether a chatbot falsely balanced between science-based and non-science-based claims, regardless of the strength of the evidence.

Each response was also rated on its readability, from whether it was written in simple, clear English to whether it used difficult, academic language.

The type of question influenced the results. Thus, the open ones generated 40 very problematic responses (significantly more than expected) and 51 non-problematic responses (significantly less than expected). In the case of closed questions, the opposite occurred, indicates BMJ Open.

Closed-ended questions required chatbots to provide predefined answers, often with a single correct answer, that conformed to scientific consensus. The open-ended ones used to require generating multiple responses in list form.

There were only two instances of refusal to respond in the process, both from Meta AI, in response to queries about anabolic steroids and alternative cancer treatments.

Overall, the quality of the references was poor, with an average completeness score of 40%, and all readability scores were rated as ‘difficult’, with complexity equivalent to that appropriate for a university graduate.

The researchers acknowledge that they only evaluated five chatbots and that commercial AI is evolving rapidly, so their conclusions may not be universally applicable.

Furthermore, not all real-world queries are deliberately confrontational, an approach they took that could have exaggerated the prevalence of problematic content.

However, the findings on scientific accuracy, quality of references, and readability of responses “highlight important behavioral limitations and the need to re-evaluate how AI chatbots are used in public health and medical communication”, point out the authors.

Chatbots, by default, do not access real-time data, but instead generate responses by deducing statistical patterns from their training data and predicting likely word sequences. “They do not reason or weigh the evidence, nor are they capable of making ethical or value-based judgments,” they explain.

Half of medical responses from AI-powered chatbots are flawed, according to scientific study

ByEditor

By Editor

Related Post

Elon Musk ‘shows off’ the Tesla AI5 AI chip

IT: Bazzi (Assist Digital), ‘our strength is competence and organizational capacity’

Canva AI 2.0 powers the creation process with artificial intelligence and agent capabilities

One thought on “Half of medical responses from AI-powered chatbots are flawed, according to scientific study”

Leave a Reply Cancel reply

You missed

Elon Musk ‘shows off’ the Tesla AI5 AI chip

Youth League: when Achraf Hakimi played against PSG and scored against his side with Real Madrid

There are more than 5.6 million bees in the New York cemetery

DeepSeek V4 at the center of the US-China chip challenge

The Observatorial