Medical diagnoses: how AI explanations help doctors

There is increasing discussion around the use of large language models like ChatGPT to support medical diagnosis. These LLMs can summarize information, suggest diagnoses, and justify their assessments in simple language. This represents a key promise of such systems: As well as providing a diagnosis, they can explain why a certain diagnosis is appropriate. But it has not yet been established whether such explanations actually help physicians – and which format is most useful.

Radiological images such as CT and MRI scans were at the heart of the study. This MRI image of a skull shows diffuse contrast-enhancing lesions in the brain. It is the job of radiologists to correctly classify these as, for example, inflammation, a tumor, or multiple sclerosis. With the right clinical questions, AI can provide support in reaching a diagnosis

Not all forms of AI assistance are equally helpful

A research team from LMU Munich, LMU University Hospital, Karlsruhe Institute of Technology, and the University of Bayreuth has now investigated how different forms of AI explanations influence diagnostic accuracy in radiology. In a randomized experiment, 101 radiologists were asked to review real patient cases with radiological images such as CT or MRI scans and provide a diagnosis for each case in the form of an open-ended text.

“Radiology often involves combining complex imaging findings with clinical information,” explains Boj Friedrich Hoppe from LMU University Hospital. “In principle, language models can support radiologists here. Our study shows, however, that not every form of AI assistance is equally helpful. What’s crucial is whether the physicians can follow the reasoning and critically evaluate the recommendation.”

Diagnosis alone is not enough

News

AI in medicine: the causality frontier

Participants were randomly assigned to one of four groups. One group worked without AI support, while the other three received different outputs from a multimodal language model. The AI either provided a diagnosis alone, a differential diagnosis, or a chain-of-thought explanation. The latter explained imaging characteristics, clinical indications, and exclusion criteria in a verifiable manner and particularly helped physicians compare the recommendation against their domain knowledge.

“For clinical practice, it’s not enough for an AI system to just give a plausible-sounding answer,” says Hoppe. “Physicians must be able to follow which indications provide grounds for a particular diagnosis and where possible uncertainties exist.”

Our results show that people can use such AI systems much more effectively if they do not just ask for an answer, but also for an account of the reasoning. A good AI answer is not just correct, but verifiable.

Stefan Feuerriegel , Professor an der LMU Munich School of Management

Step-by-step explanations improve accuracy

The study shows that radiologists obtain the highest diagnostic accuracy with step-by-step AI explanations – the success rate was 12.2 percentage points above that of the control group without AI. Simple diagnostic outputs and differential diagnoses performed less well. Particularly in the case of incorrect AI suggestions, participants followed the differential diagnosis more frequently, which points to automation bias. Step-by-step explanations, by contrast, helped the physicians adopt correct suggestions in a more informed manner while also making them more likely to recognize errors.

The results suggest that the quality of the diagnosis alone is not decisive, but that the format of the explanation helps physicians critically evaluate the recommendation. Step-by-step justifications make the model’s argumentation more visible and allow doctors to compare it against their domain knowledge.

Differential diagnoses are important in medicine. In conjunction with language models, however, they can give the impression that the various diagnoses they present cover the entire diagnostic space. When dealing with rare or complex cases, this can make physicians less likely to think beyond the diagnoses provided by the AI.

Professor Stefan Feuerriegel in his institute's server room

Significance beyond medicine

Although the study focuses on radiology, its results apply well beyond this field, according to Professor Stefan Feuerriegel from the LMU Munich School of Management and corresponding author of the study. Systems like ChatGPT are increasingly being used for decision-making in everyday personal and professional contexts. “Our results show that people can use such AI systems much more effectively if they do not just ask for an answer, but also for an account of the reasoning.”

The type of interaction is vital here as well as the capabilities of the models. Users should actively assess AI answers, notes Feuerriegel: “A good AI answer is not just correct, but verifiable.”

Errors that sound convincing

The researchers emphasize that language models can make errors – both in diagnoses and their justification. Accordingly, AI systems should not be used as a substitute for medical expertise, but as tools to support physicians.

Step-by-step explanations in particular can render the AI’s assumptions visible and help doctors critically evaluate recommendations. The study demonstrates that AI improves diagnostic performance above all when its suggestions are presented along with explanations of its reasoning. By contrast, short answers and unelaborated lists can foster misplaced confidence in the AI’s suggestions.

Philipp Spitzer, Daniel Hendriks, Jan Rudolph, Sarah Schlaeger, Jens Ricke, Niklas Kühl, Boj Friedrich Hoppe & Stefan Feuerriegel. The effect of medical explanations from large language models on diagnostic accuracy in radiology. In: npj Digital Medicine, 2026.