Comparative analysis of machine learning algorithms for biomedical text document classification: A case study on cancer-related publications


Creative Commons License

Küçük E., BALIKCI CICEK I., KÜÇÜKAKÇALI Z., yetiş c.

Medicine Science, vol.13, no.1, pp.171-174, 2024 (Peer-Reviewed Journal) identifier

  • Publication Type: Article / Article
  • Volume: 13 Issue: 1
  • Publication Date: 2024
  • Doi Number: 10.5455/medscience.2023.10.209
  • Journal Name: Medicine Science
  • Journal Indexes: TR DİZİN (ULAKBİM)
  • Page Numbers: pp.171-174
  • Inonu University Affiliated: Yes

Abstract

Biomedical text document classification is an essential task within Natural Language Processing (NLP), with applications ranging from sentiment analysis to authorship identification. Despite advancements in traditional machine-learning algorithms like Support Vector Machines (SVM) and Logistic Regression, challenges such as data sparsity and high dimensionality persist. Recent years have seen a surge in the use of deep learning models to mitigate these issues. This study aims to conduct a comparative analysis of various machine-learning algorithms for classifying biomedical text documents. The study employs the "Medical Text Dataset - Cancer Doc Classification" from Kaggle, comprising 7570 biomedical text documents labeled into three types of cancer (colon, lung, and thyroid). A preprocessing pipeline involving tokenization, stop-word removal, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is applied. Algorithms including Logistic Regression, SVM, and Multinomial Naive Bayes are evaluated through 5-fold cross-validation. Performance metrics like accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC) are employed. Logistic Regression outperforms the other algorithms with an accuracy of 78.3% and an AUC ROC of 88.59%. SVM and Multinomial Naive Bayes follow with lower performance metrics. Hyperparameter tuning further enhances the performance of the algorithms, particularly Logistic Regression. The study makes a significant contribution to the field of biomedical text classification by systematically comparing machine-learning algorithms. Logistic Regression emerges as the most effective, emphasizing the importance of algorithm selection and hyperparameter tuning in machine learning applications within this domain.