ChatGPT Health fails critical emergency and suicide safety tests


In an evolving health landscape, emerging research continues to highlight concerns that could impact everyday wellbeing. Here’s the key update you should know about:

ChatGPT Health, a widely used consumer artificial intelligence (AI) tool that provides health guidance directly to the public-including advice about how urgently to seek medical care-may fail to direct users appropriately to emergency care in a significant number of serious cases, according to researchers at the Icahn School of Medicine at Mount Sinai.

The study, fast-tracked in the February 23, 2026 online issue of Nature Medicine [https://doi.org/10.1038/s41591-026-04297-7], is the first independent safety evaluation of the large language model (LLM)-based tool since its January 2026 launch. It also identified serious concerns with the tool’s suicide-crisis safeguards.

“LLMs have become patients’ first stop for medical advice-but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm,” says Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research. “When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional.”

Within weeks of its release, ChatGPT Health’s maker, OpenAI, reported that about 40 million people were using the tool daily to seek health information and guidance, including advice about whether to seek urgent or emergency care. At the same time, say the investigators, there was little independent evidence about how safe or reliable its advice actually was.

That gap motivated our study. We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?”


Ashwin Ramaswamy, MD, lead author, Instructor of Urology, Icahn School of Medicine, Mount Sinai

With respect to suicide-risk alerts, ChatGPT Health was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. However, the investigators found that these alerts appeared inconsistently, sometimes triggering in lower-risk scenarios while-alarmingly-failing to appear when users described specific plans for self-harm.

“This was a particularly surprising and concerning finding,” says senior and co-corresponding study author Girish N. Nadkarni, MD, MPH, Barbara T. Murphy Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. “While we expected some variability, what we observed went beyond inconsistency. The system’s alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves. In real life, when someone talks about exactly how they would harm themselves, that’s a sign of more immediate and serious danger, not less.”

As part of the evaluation, the research team created 60 structured clinical scenarios spanning 21 medical specialties. Cases ranged from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies.

Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics (such as someone minimizing symptoms), and barriers to care like lack of insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health and compared its recommendations with physician consensus.

In testing the 60 realistic patient scenarios developed by physicians, the researchers found that while the tool generally handled clear-cut emergencies correctly, it under-triaged more than half of cases that physicians determined required emergency care.

The investigators were also struck by how the system failed in emergency medical cases. The tool often demonstrated that it recognized dangerous findings in its own explanations, yet still reassured the patient.

“ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” says Dr. Ramaswamy. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most. In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment.”

The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on chatbot guidance. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency department.

Still, the researchers emphasize that the findings do not suggest consumers should abandon AI health tools altogether.

“As a medical student training at a time when AI health tools are already in the hands of millions, I see them as technologies we must learn to integrate thoughtfully into care rather than substitutes for clinical judgment,” says Alvira Tyagi, a first-year medical student at the Icahn School of Medicine at Mount Sinai and second author of the study. “These systems are changing quickly, so part of our training now must consider learning how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients.”

The study assessed the system at a single point in time. Because AI models are frequently updated, performance may change over time, underscoring the need for independent evaluation, the researchers say.

“Starting medical training alongside tools that are evolving in real time makes it clear that today’s results are not set in stone,” Ms. Tyagi says. “That reality calls for ongoing review to ensure that improvements in technology translate into safer care.”

The team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, expanding future research into areas such as pediatric care, medication safety, and non-English-language use.

The paper is titled “ChatGPT Health performance in a structured test of triage recommendations.” 

The study’s authors, as listed in the journal, are Ashwin Ramaswamy, MD, MPP; Alvira Tyagi, BA; Hannah Hugo, MD; Joy Jiang, PhD; Pushkala Jayaraman, PhD; Mateen Jangda, MSc; Alexis E. Te, MD; Steven A. Kaplan, MD; Joshua Lampert, MD; Robert Freeman, MSN, MS; Nicholas Gavin, MD, MBA; Ashutosh K. Tewari, MBBS, MCh; Ankit Sakhuja, MBBS MS; Bilal Naved, PhD; Alexander W. Charney, MD, PhD; Mahmud Omar, MD; Michael A. Gorin, MD; Eyal Klang, MD; Girish N. Nadkarni, MD, MPH.

Source:

Journal reference:

Ramaswamy, A., et al. (2026). ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. DOI: 10.1038/s41591-026-04297-7. https://www.nature.com/articles/s41591-026-04297-7


Source link

Exit mobile version