ChatGPT Health underestimates the severity of about half of medical emergencies

The journal Nature Medicine published the results of a study according to which the specialized medical chatbot ChatGPT Health, developed by OpenAI, often underestimates the severity of emergency conditions.

The researchers assessed the system’s ability to triage—that is, determine the urgency of a case—based on real-life clinical scenarios. It has previously been reported that ChatGPT is capable of successfully passing medical exams, and in 2024, almost two-thirds of doctors admitted that they use AI tools in their work. However, other studies have pointed to the unreliability of medical advice provided by chatbots.

As part of the study, specialists uploaded 60 clinical cases into the system. The responses received were compared with the assessments of three doctors, who determined the urgency of situations based on medical advice and their own experience. Each scenario had 16 variations with changes in patient gender or race, but these differences were not expected to affect the final classification. There were no significant differences in demographic parameters.

The results showed that the bot underestimated the severity of 51.6% of emergencies: instead of immediately going to the emergency room, it recommended making an appointment with a doctor within 24-48 hours. Among these situations were diabetic ketoacidosis and respiratory failure – conditions that pose a direct threat to life without timely treatment. At the same time, the system recognized obvious emergency conditions, such as a stroke with typical symptoms, correctly in 100% of cases.

An OpenAI spokesperson said the company welcomes such research, but stressed that its methodology does not reflect the actual way ChatGPT Health is used. According to him, the bot is designed for dialogue with the possibility of clarifying questions, and not for a single answer according to a ready-made script. The company also noted that the service is still available to a limited number of users and continues to be improved to improve security and accuracy.

It was noted that in non-emergency cases, on the contrary, the system often overestimated the severity: in 64.8% of such situations, it recommended a visit to the doctor without sufficient grounds. For example, with a three-day sore throat, the bot advised contacting a specialist within two days, although home treatment would be sufficient. There was also inconsistency in responses to reports of suicidal ideation or self-harm. Although the system should direct the user to the crisis helpline 988 for such statements, in the study the bot sometimes provided this number unnecessarily and, conversely, did not provide it in situations where it was required.

By Editor