Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications

Küçük, Ekrem; BALIKCI CICEK, İPEK; KÜÇÜKAKÇALI, ZEYNEP; yetiş, CİHAN

doi:10.5455/medscience.2023.10.209

Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications

Küçük E., BALIKCI CICEK I., KÜÇÜKAKÇALI Z., yetiş c.

Medicine Science, cilt.13, sa.1, ss.171-174, 2024 (TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 13 Sayı: 1
Basım Tarihi: 2024
Doi Numarası: 10.5455/medscience.2023.10.209
Dergi Adı: Medicine Science
Derginin Tarandığı İndeksler: TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.171-174
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
İnönü Üniversitesi Adresli: Evet

Özet

Biomedical text document classification is an essential task within Natural Language Processing (NLP), with applications ranging from sentiment analysis to authorship identification. Despite advancements in traditional machine-learning algorithms like Support Vector Machines (SVM) and Logistic Regression, challenges such as data sparsity and high dimensionality persist. Recent years have seen a surge in the use of deep learning models to mitigate these issues. This study aims to conduct a comparative analysis of various machine-learning algorithms for classifying biomedical text documents. The study employs the "Medical Text Dataset - Cancer Doc Classification" from Kaggle, comprising 7570 biomedical text documents labeled into three types of cancer (colon, lung, and thyroid). A preprocessing pipeline involving tokenization, stop-word removal, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is applied. Algorithms including Logistic Regression, SVM, and Multinomial Naive Bayes are evaluated through 5-fold cross-validation. Performance metrics like accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC) are employed. Logistic Regression outperforms the other algorithms with an accuracy of 78.3% and an AUC ROC of 88.59%. SVM and Multinomial Naive Bayes follow with lower performance metrics. Hyperparameter tuning further enhances the performance of the algorithms, particularly Logistic Regression. The study makes a significant contribution to the field of biomedical text classification by systematically comparing machine-learning algorithms. Logistic Regression emerges as the most effective, emphasizing the importance of algorithm selection and hyperparameter tuning in machine learning applications within this domain.