Predicting HFA 30–2 Visual Fields with Deep Learning from Multimodal OCT–Fundus Feature Fusion and Structure–Function Discordance Analysis

Fırat, İlknur; Fırat, MURAT; ERBALI, HACİ; Tuncer, Taner

doi:10.1007/s10278-025-01798-8

Predicting HFA 30–2 Visual Fields with Deep Learning from Multimodal OCT–Fundus Feature Fusion and Structure–Function Discordance Analysis

Fırat İ. T., Fırat M., ERBALI H., Tuncer T.

Journal of Imaging Informatics in Medicine, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1007/s10278-025-01798-8
Dergi Adı: Journal of Imaging Informatics in Medicine
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Anahtar Kelimeler: Humphrey 30–2, Optical Coherence Tomography, Structure–function Discordance, Vision Transformer, Visual Field
İnönü Üniversitesi Adresli: Evet

Özet

Glaucoma is a leading cause of irreversible vision loss. During clinical follow-up, visual field (VF) tests (Humphrey Field Analyzer 30–2) assesses functional loss, while optical coherence tomography (OCT) and fundus imaging provide structural information. However, VF measurement can be subjective, exhibit test–retest variability, and sometimes exhibit structure–function discordance (SFD). Therefore, predicting VF values from structural images may support clinical decision-making. To estimate Humphrey 30–2 measures (mean deviation (MD), pattern standard deviation (PSD), and point-wise threshold sensitivity (TS)) in glaucoma/ocular hypertension (OHT) using a ViT-B/32-based feature-fusion approach on OCT and fundus images, and to examine the effect of SFD via sensitivity analysis. Visual features were extracted from color optic disc photographs, red-free fundus images, retinal nerve fiber layer (RNFL) thickness map, and circular RNFL plots using Vision Transformer (ViT-B/32)-based models. These features were combined with demographic and clinical data to form a multimodal artificial intelligence model. Global VF indices (MD, PSD) were estimated with probabilistic regression that accounts for uncertainty, and point-wise TS values were predicted using a location-aware network. In a separate analysis, eyes exhibiting SFD were identified and excluded to assess model performance under OCT–VF concordance. Mean absolute errors (MAE) were 2.26, 1.42, and 2.96 dB for MD, PSD, and mean TS, respectively, and the proportions within ± 2 dB were 59.65%, 75.44%, and 57.90%. After excluding SFD eyes, MAEs decreased to 1.82, 1.30, and 2.12 dB for MD, PSD, and mean TS, respectively; the proportions within ± 2 dB increased to 66.7%, 76.5% and 62.7%, respectively. These findings indicate that discordance affects performance and that predictions are more reliable in clinically concordant cases. ViT-B/32-based deep feature fusion offers clinically meaningful accuracy for predicting VF metrics from multimodal structural images. SFD was frequently detected among the lowest-performing cases, and this possibility should be considered when interpreting low-performing outputs.