Large language models excel in tests yet struggle to guide real patient decisions

A randomized study of 1,298 UK adults found that while large language models perform well on medical tasks alone, they do not improve and can worsen decision-making when used by the public. Failures stem from human–AI interaction issues, showing that benchmark accuracy does not predict safe or effective real-world medical support.