The classification of cancer subtype based on machine learning
Liu, Zhihao (2025)
Kandidaatintyö
Liu, Zhihao
2025
School of Engineering Science, Tietotekniikka
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2025052654678
https://urn.fi/URN:NBN:fi-fe2025052654678
Tiivistelmä
Effective classification of cancer subtypes plays a crucial role in designing personalized treatment plans and improving patient outcomes. Traditional classification methods may not reflect the complexity and diversity of cancer types. This study utilizes gene expression profiles from TCGA to investigate how machine learning techniques can be improved to identify cancer subtypes.
The study used three machine learning algorithms (RF, SVM, KNN), as well as a stacked ensemble approach using logistic regression as a meta-classifier. The study processed gene expression data from 10,446 samples of 33 different cancer types from TCGA, and their type labels. Model performance was evaluated using various metrics including overall accuracy, macro-mean of precision, macro-mean of recall, macro-mean of F1 score, weighted mean of precision, weighted mean of recall, and weighted mean of F1 score.
The results show that the stacked ensemble model outperforms the individual classifiers, achieving an accuracy of 0.7836, a macroscopic precision of 0.7257, a macroscopic recall of 0.6746, and a macroscopic F1 score of 0.6860, with values of 0.7713 for the weighted precision, 0.7763 for the recall, and 0.7637 for the F1 score, respectively. Ensemble methods successfully utilize different classification boundaries for RF, SVM, and KNN to ensure more reliable classification for less common subtypes. These results emphasize the benefits of integrated learning methods in bioinformatics and present a scalable solution for cancer subtype prediction. Future work could enhance this framework by incorporating multi-omics datasets and validating results with external clinical resources.
The study used three machine learning algorithms (RF, SVM, KNN), as well as a stacked ensemble approach using logistic regression as a meta-classifier. The study processed gene expression data from 10,446 samples of 33 different cancer types from TCGA, and their type labels. Model performance was evaluated using various metrics including overall accuracy, macro-mean of precision, macro-mean of recall, macro-mean of F1 score, weighted mean of precision, weighted mean of recall, and weighted mean of F1 score.
The results show that the stacked ensemble model outperforms the individual classifiers, achieving an accuracy of 0.7836, a macroscopic precision of 0.7257, a macroscopic recall of 0.6746, and a macroscopic F1 score of 0.6860, with values of 0.7713 for the weighted precision, 0.7763 for the recall, and 0.7637 for the F1 score, respectively. Ensemble methods successfully utilize different classification boundaries for RF, SVM, and KNN to ensure more reliable classification for less common subtypes. These results emphasize the benefits of integrated learning methods in bioinformatics and present a scalable solution for cancer subtype prediction. Future work could enhance this framework by incorporating multi-omics datasets and validating results with external clinical resources.