Machine Learning-Driven Metabolomic Biomarker Discovery for PCOS: An Interpretable Approach Using Random Forest and SHAP


YAŞAR Ş.

Medical records-international medical journal (Online), cilt.7, sa.3, ss.763-767, 2025 (TRDizin) identifier

Özet

Aim: This study aimed to predict Polycystic Ovary Syndrome (PCOS) using follicular fluid metabolomic data and the Random Forest algorithm, and to interpret the contributions of the most influential metabolites using SHapley Additive exPlanations (SHAP) analysis. Material and Method: An untargeted metabolomic dataset of follicular fluid from 35 PCOS patients and 37 age-matched controls was utilized. The dataset was partitioned into 70% training and 30% testing subsets using stratified sampling. A Random Forest algorithm was employed, with hyperparameter optimization performed using RandomizedSearchCV. Model performance was evaluated using accuracy, sensitivity, specificity, F1 score, balanced accuracy, and Brier score. SHAP analysis was then applied to interpret the model's predictions and identify key contributing metabolites. Results: The Random Forest model achieved robust classification performance, with an accuracy of 0.86, sensitivity of 0.82, specificity of 0.91, F1 score of 0.86, balanced accuracy of 0.85, and a Brier score of 0.13. SHAP analysis identified L-Histidine, L-Glutamine, and L-Tyrosine as the top three most influential metabolites. Specifically, decreased levels of L-Histidine and L-Tyrosine, and elevated levels of L-Glutamine, were associated with an increased risk of PCOS. Conclusion: Our findings demonstrate the potential of integrating machine learning with explainable AI to accurately predict PCOS based on metabolomic profiles. The identified metabolites, particularly alterations in amino acid metabolism, offer novel insights into the metabolic underpinnings of PCOS and highlight their promise as diagnostic biomarkers, paving the way for more precise and interpretable diagnostic strategies.