The study comes as OpenAI released ChatGPT Health, a dedicated bot to tackle health and wellness queries from users.
Artificial Intelligence (AI) models frequently provide incorrect advice to users regarding medical questions, raising concerns about their widespread deployment in the public sphere, according to a Feb. 9 peer-reviewed study published in the journal Nature Medicine.
Lead medical practitioner on the study Dr Rebecca Payne said in a statement that people should be aware that asking LLMs about their symptoms can be dangerous, as these models may provide incorrect diagnoses.
AI “just isn’t ready” to take the role of a physician, Payne said.
In the study, researchers recruited nearly 1,300 individuals aged 18 or older from the United Kingdom. These individuals were presented with a medical scenario and tasked with identifying potential health conditions and recommending a course of action.
Participants were split into four groups. Three groups were provided with an AI large language model (LLM)—GPT-4o, Llama 3, and Command R+—for assisting them in completing the task. The fourth was a control group that was asked to use any methods they typically would use at home to complete the task.
Researchers also fed the scenario and questions directly into the AI models to assess their performance without interacting with participants.
“Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9 percent of cases and disposition in 56.3 percent on average,” the study said. Disposition refers to the recommended course of action.
“However, participants using the same LLMs identified relevant conditions in fewer than 34.5 percent of cases and disposition in fewer than 44.2 percent, both no better than the control group.”
Participants in the control group had 1.76 times the odds of identifying a relevant condition as those in the LLM-based groups.
AI was found to have generated several misleading and incorrect pieces of information. In two situations, LLMs initially provided correct responses but later produced incorrect answers when participants provided additional details.
In one case, two users were given opposite advice despite sending similar messages describing symptoms of a subarachnoid hemorrhage.
“In our work, we found that none of the tested language models were ready for deployment in direct patient care. Despite strong performance from the LLMs alone, both on existing benchmarks and on our scenarios, medical expertise was insufficient for effective patient care,” the researchers wrote.
“We recommend that developers, as well as policymakers and regulators, consider human user testing as a foundation for better evaluating interactive capabilities before any future deployments.”







