
ZDNET’s key takeaways
- AI frontier models fail to provide safe and accurate output on medical topics.
- LMArena and DataTecnica aim to ‘rigorously’ test LLMs’ medical knowledge.
- It’s not clear how agents and medicine-specific LLMs will be measured.
Get more in-depth ZDNET tech coverage: Add us as a preferred Google source on Chrome and Chromium browsers.
Despite the numerous AI advances in medicine cited throughout scholarly literature, all generative AI programs fail to produce output that is both safe and accurate when dealing with medical topics, according to a new report by startup DataTecnica and CARD, the U.S. National Institutes of Health’s Center for Alzheimer’s and Related Dementias at the National Institute on Aging.
The finding is especially concerning given that people are going to bots such as ChatGPT for medical answers, and research shows that people trust AI’s medical advice over the advice of doctors, even when it’s wrong.
Also: Patients trust AI’s medical advice over doctors – even when it’s wrong, study finds
The new study, comparing OpenAI’s GPT-5 with numerous models from Google, Anthropic and Meta on a benchmark of medical science, CARDBiomedBench, finds that “performance in real-world biomedical research remains far from adequate.”
The CARDBiomedBench benchmark suite, a question-and-answer benchmark for evaluating LLMs in biomedical research, was unveiled earlier this year in a collaboration between DataTecnica and CARD researchers.
(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)
A knowledge gap in medicine
“No current model reliably meets the reasoning and domain-specific knowledge demands of biomedical scientists,” according to DataTecnica and NIH’s CARD.
The report concludes that current models are simply too lax and too fuzzy to meet the standards of medicine:
“This fundamental gap highlights the growing mismatch between general AI capabilities and the needs of specialized scientific communities. Biomedical researchers work at the intersection of complex, evolving knowledge and real-world impact. They don’t need models that ‘sound’ correct; they need tools that help uncover insights, reduce error, and accelerate the pace of discovery.”
The study echoes findings from other benchmark tests related to medicine. For example, in May, OpenAI unveiled HealthBench, a suite of text prompts concerning medical situations and conditions that could reasonably be submitted to a chatbot by a person seeking medical advice. That study found that “while performance has improved over time, including cost-adjusted performance and reliability, significant headroom still exists in current models’ ability to engage in health-related conversations and workflows.”
However, other studies have found that LLMs make progress on tests of general physician knowledge, such as a study published this month by scholars at Emory University School of Medicine. Those tests, including MedQA, involve different kinds of scoring than the CARDBiomedBench, and they don’t cover the kinds of medical science topics that are the focus of CARDBiomedBench.
Also: OpenAI’s HealthBench shows AI’s medical advice is improving – but who will listen?
Expanding the benchmark
To address the gap between AI models and medicine, DataTecnica have teamed up with LMArena.ai, which ranks AI models.
Together, LMArena and DataTecnica plan to expand what’s called BiomedArena, a leaderboard that lets people compare AI models side by side and vote on which ones perform the best.
Also: Meta’s Llama 4 ‘herd’ controversy and AI contamination, explained
BiomedArena is meant to be specific to medical research, rather than very general questions, unlike general-purpose leaderboards.
The BiomedArena work is already used by scientists at the Intramural Research Program of the US National Institutes of Health, they note, “where scientists pursue high-risk, high-reward projects that are often beyond the scope of traditional academic research due to their scale, complexity, or resource demands.”
The BiomedArena work, according to the LMArena team, will “focus on tasks and evaluation strategies grounded in the day-to-day realities of biomedical discovery — from interpreting experimental data and literature to assisting in hypothesis generation and clinical translation.”
Also: You can track the top AI image generators via this new leaderboard – and vote for your favorite too
As related back in June by ZDNET’s Webb Wright, LMArena was originally founded as a research initiative through UC Berkeley with the name Chatbot Arena, and has since become a full-fledged platform, with financial support from UC Berkeley, a16z, Sequoia Capital, and elsewhere.
Where could they go wrong?
Two big questions loom for this new benchmark effort.
First, studies with doctors have shown that gen AI’s usefulness expands dramatically when AI models are hooked up to databases of “gold standard” medical information, with dedicated large language models (LLMs) able to outperform the top frontier models just by tapping into information.
Also: Hooking up generative AI to medical data improved usefulness for doctors
From today’s announcement, it’s not clear how LMArena and DataTecnica plan to address that aspect of AI models, which really is a kind of agentic capability — the ability to tap into resources. Without measuring how AI models use external resources, the benchmark could have limited utility.
Second, numerous medicine-specific LLMs are being developed all the time, including Google’s “MedPaLM” program introduced two years ago. It’s not clear if the BiomedArena work will take into account these dedicated medicine LLMs. The work so far has tested only general frontier models.
Also: Google’s MedPaLM emphasizes human clinicians in medical AI
That’s a perfectly valid choice on the part of LMArena and DataTecnica, but it does leave out a whole lot of important effort.