Artificial intelligence chatbots may be getting better at medical exams, but a new study suggests they are still unreliable when used by ordinary people seeking real-world health guidance.
Researchers found that members of the public using AI tools to interpret symptoms and decide what action to take were no more accurate than those relying on traditional internet searches, raising concerns about the growing trend of turning to chatbots for medical information.
“Despite all the hype, Artificial Intelligence (AI) just isn’t ready to take on the role of the physician,” study co-author Rebecca Payne from Oxford University said.
“Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed,” she added.
The British-led research, published Monday in the journal Nature Medicine, tested how well people could identify health conditions and determine whether they needed medical attention when supported by AI chatbots.
Nearly 1,300 participants in the UK were shown 10 symptom scenarios, including situations such as a headache after drinking alcohol, severe exhaustion in a new mother, and symptoms associated with gallstones.
Participants were randomly assigned to use one of three chatbots — OpenAI’s GPT-4o, Meta’s Llama 3, or Command R+ — while a separate group relied on search engines and conventional online resources.
The results showed that users of AI chatbots correctly identified the health condition only about one-third of the time, while fewer than half selected the appropriate next step, such as seeing a doctor or going to a hospital.
The researchers reported that relevant conditions were identified in less than 34.5% of cases, while the correct course of action was chosen in under 44.2% of cases. These outcomes were not better than those of participants who used non-AI methods.
The team said the findings highlighted a major disconnect between how AI performs in controlled medical benchmarks and how it functions when used by real people.
Researchers attributed the gap partly to the way humans interact with chatbots, noting that participants often failed to provide complete or accurate information when describing symptoms. They also found that users sometimes misunderstood, ignored, or misread the chatbot’s recommendations.
Adam Mahdi, a co-author of the paper and an associate professor at Oxford, said the study revealed the “huge gap” between the promise of AI and how it can fail in real use.
“The knowledge may be in those bots; however, this knowledge doesn’t always translate when interacting with humans,” he said.
The researchers examined around 30 interactions in greater detail and concluded that the problem was not only user input but also incorrect or misleading chatbot outputs.
One example involved symptoms consistent with a subarachnoid haemorrhage, a dangerous condition involving bleeding in the brain. When one participant described having a stiff neck, sensitivity to light, and the “worst headache ever,” the chatbot advised going to hospital. But another participant describing similar symptoms with a “terrible” headache was advised instead to lie down in a dark room.
The study comes as the researchers noted that one in six adults in the United States now seek health information from AI chatbots at least once a month, a figure expected to grow as adoption increases.
David Shaw, a bioethicist at Maastricht University in the Netherlands who was not involved in the research, warned that the findings demonstrated real risks.
“This is a very important study as it highlights the real medical risks posed to the public by chatbots,” Shaw told reporters.
He urged people to rely on trusted health sources such as the UK’s National Health Service.
The study was supported by the data company Prolific, the German non-profit Dieter Schwarz Stiftung, and the UK and US governments.

