ChatGPT response consistency to the 2025 ESC/EACTS guidelines for the management of valvular heart disease: A test–retest study using binary and multiple-choice questions

Mirzaoğlu, Çetin; ULUTAŞ, ZEYNEP; Karaca, Yücel

doi:10.1177/20552076261458145

ChatGPT response consistency to the 2025 ESC/EACTS guidelines for the management of valvular heart disease: A test–retest study using binary and multiple-choice questions

Mirzaoğlu Ç., ULUTAŞ Z., Karaca Y.

Digital Health, cilt.12, 2026 (SCI-Expanded, SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 12
Basım Tarihi: 2026
Doi Numarası: 10.1177/20552076261458145
Dergi Adı: Digital Health
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Directory of Open Access Journals, Health Research Premium Collection (ProQuest)
Anahtar Kelimeler: artificial intelligence, ChatGPT, clinical decision support, ESC/EACTS guidelines, valvular heart disease
İnönü Üniversitesi Adresli: Evet

Özet

Background/Objectives: This study aimed to evaluate the response variability and temporal instability of responses generated by the artificial intelligence–based model ChatGPT-5.2 to structured clinical questions derived from the 2025 ESC/EACTS Guidelines for the Management of Valvular Heart Disease (GMVHD). Methods: This prospective observational study employed a test–retest design. A structured set of 100 guideline-based questions—comprising 60 binary (true/false) and 40 multiple-choice items—was developed by two cardiologists (Ç.M. and Z.U.) in accordance with the 2025 ESC/EACTS GMVHD. The question set was administered to ChatGPT-5.2 on two separate occasions with a 14-day interval. The model was instructed to provide answers only, without any explanatory commentary. ChatGPT-generated responses were independently evaluated and coded as correct or incorrect by two cardiologists (Y.K. and Ç.M.). Numerical changes in responses were assessed using McNemar’s test, while test–retest reliability was evaluated using Cohen’s kappa coefficient. Results: For binary questions, ChatGPT demonstrated an accuracy of 96.7% in both assessments. Accuracy for multiple-choice questions increased from 75.0% at baseline to 87.5% at the second assessment. When all questions were analyzed together, overall accuracy improved from 88.0% to 93.0%. A numerical increase in accuracy was observed between T1 and T2, without a statistically significant temporal difference. (p > 0.05). Cohen’s kappa analysis indicated moderate agreement for binary questions and low agreement for multiple-choice questions. Conclusion: ChatGPT-5.2 demonstrated numerical improvement without statistically significant temporal difference and short-term performance change when answering guideline-based clinical questions on valvular heart disease. However, the relatively high initial error rate in multiple-choice questions represents a limitation for clinical reliability. At present, AI systems may be considered supportive tools for guideline-based information retrieval and clinical education.