Performance of large language models on neuroanatomy-based medical riddles: a comparative study

Kaçar, Hüma; Turamanlar, Ozan; Emir, Büşra; Yakıncı, Cengiz

doi:10.1007/s00276-026-03824-y

Performance of large language models on neuroanatomy-based medical riddles: a comparative study

Kaçar H., Turamanlar O., Emir B., Yakıncı C.

Surgical and Radiologic Anatomy, cilt.48, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 48 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1007/s00276-026-03824-y
Dergi Adı: Surgical and Radiologic Anatomy
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE
Anahtar Kelimeler: Anatomy education, ChatGPT, Gemini, Large language models, Medical riddle
İnönü Üniversitesi Adresli: Evet

Özet

Purpose: The integration of large language models (LLMs) into medical education has gained significant momentum in recent years. These models have demonstrated highly effective performance in medical board examination questions. However, their ability to comprehend, analyze, and reason through information has not yet been evaluated using medical riddles as an alternative assessment approach. Therefore, the aim of this study is to assess the performance of commercially available, general-purpose LLMs in solving medical riddles. Methods: Responses generated by ChatGPT-5, ChatGPT-4, AnatomyGPT, Gemini 2.5, Claude, and DeepSeek for 20 neuroanatomy-related riddles were evaluated across two trials. Additionally, the riddles were presented in a different language to assess the impact of linguistic variation. Statistical analyses were conducted using Cochran’s Q test and chi-square tests to compare the performance of the models. Response consistency was assessed using McNemar’s test and Cohen’s kappa coefficient. Results: All models demonstrated strong performance on the riddles. Near-perfect accuracy was observed when the models were tested in English (ChatGPT-5 100%, ChatGPT-4 100%, AnatomyGPT 100%, Gemini 2.5 100%, DeepSeek 100%, Claude 95%). When tested in Turkish, Gemini 2.5 (80%) and DeepSeek (85%) showed relatively lower accuracy; however, overall correct response rates remained high across models. In terms of response consistency, five models demonstrated high agreement, while only Gemini 2.5 (κ = 0.347) showed moderate agreement. Conclusion: This study demonstrates that LLMs can successfully solve medical riddles with comparable levels of performance. These findings provide valuable insights into the current capabilities of LLMs in understanding, analyzing, and reasoning through domain-specific problem-solving tasks.