Comparative Analysis of Classical ML Baselines on Noisy and Imbalanced Occupational Lung Disease Data in Vietnam

Phuong Luong-Thi-Bich; Quan Nguyen-Minh; Hung Vo-Tri

Research Article

Comparative Analysis of Classical ML Baselines on Noisy and Imbalanced Occupational Lung Disease Data in Vietnam

by Phuong Luong-Thi-Bich, Quan Nguyen-Minh, Hung Vo-Tri

Journal of Advanced Artificial Intelligence

Foundation of Computer Science (FCS), NY, USA

Volume 2 - Issue 4

Published: January 2026

Authors: Phuong Luong-Thi-Bich, Quan Nguyen-Minh, Hung Vo-Tri

PDF

Phuong Luong-Thi-Bich, Quan Nguyen-Minh, Hung Vo-Tri . Comparative Analysis of Classical ML Baselines on Noisy and Imbalanced Occupational Lung Disease Data in Vietnam. Journal of Advanced Artificial Intelligence. 2, 4 (January 2026), 20-26.

                        @article{ placeholder_doi,
                        author  = { Phuong Luong-Thi-Bich,Quan Nguyen-Minh,Hung Vo-Tri },
                        title   = { Comparative Analysis of Classical ML Baselines on Noisy and Imbalanced Occupational Lung Disease Data in Vietnam },
                        journal = { Journal of Advanced Artificial Intelligence },
                        year    = { 2026 },
                        volume  = { 2 },
                        number  = { 4 },
                        pages   = { 20-26 },
                        
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Phuong Luong-Thi-Bich
                        %A Quan Nguyen-Minh
                        %A Hung Vo-Tri
                        %T Comparative Analysis of Classical ML Baselines on Noisy and Imbalanced Occupational Lung Disease Data in Vietnam%T 
                        %J Journal of Advanced Artificial Intelligence
                        %V 2
                        %N 4
                        %P 20-26
                        
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Occupational lung disease is one of the most serious health problems affecting the global workforce. Early prediction of disease risk is important in medical prevention and intervention. In this study, the proposed approach conducted a comparison of four classical machine learning models—Random Forest (RF), XGBoost, Logistic Regression (LR), and Support Vector Machine (SVM)—on the same set of occupational lung disease data that had been manually processed and encoded. The experimental results show that XGBoost achieves the best performance with an accuracy of 98.34% and a Macro F1-score of 0.7996, followed by LR, RF and SVM. In addition, the characteristic analysis shows that each model focuses on different factors, suggesting the potential to combine multiple models to improve prediction efficiency.

References

M. M. Islam, M. R. Haque, H. Iqbal, M. M. Hasan, M. Hasan, and M. N. Kabir, “Breast cancer prediction: a comparative study using machine learning techniques,” SN Computer Science, vol. 1, pp. 1–14, 2020.
V. Ramalingam, A. Dandapath, and M. K. Raja, “Heart disease prediction using machine learning techniques: a survey,” International Journal of Engineering & Technology, vol. 7, no. 2.8, pp. 684–687, 2018.
K. Nguyen-Trong, T. Vu-Van, P. Luong Thi Bich, “Graph Convolutional Network for Occupational Disease Prediction with Multiple Dimensional Data,” International Journal of Advanced Computer Science and Applications, vol. 15, no. 7, 2024.
K. Pingale, S. Surwase, V. Kulkarni, S. Sarage, and A. Karve, “Disease prediction using machine learning,” International Research Journal of Engineering and Technology (IRJET), vol. 6, no. 12, pp. 831–833, 2019.
G. Sailasya and G. L. A. Kumari, “Analyzing the performance of stroke prediction using ML classification algorithms,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, 2021.
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression, 3rd ed., Wiley, 2013.
L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD, pp. 785–794, 2016.
K. Nguyen-Trong, T. Vu-Van, P. Luong Thi Bich, “Graph Convolutional Network for Occupational Disease Prediction with Multiple Dimensional Data,” International Journal of Advanced Computer Science and Applications, vol. 15, no. 7, 2024.
A. M. Barhoom, A. Almasri, B. S. Abu-Nasser, and S. S. Abu-Naser, “Prediction of heart disease using a collection of machine and deep learning algorithms,” 2022.
N. Biswas, K. M. M. Uddin, S. T. Rikta, and S. K. Dey, “A comparative analysis of machine learning classifiers for stroke prediction: A predictive analytics approach,” Healthcare Analytics, vol. 2, p. 100116, 2022.
M. Ashrafuzzaman, S. Saha, and K. Nur, “Prediction of stroke disease using deep CNN based approach,” Journal of Advances in Information Technology, vol. 13, no. 6, 2022.
J. Prusa, T. M. Khoshgoftaar, D. J. Dittman, and A. Napolitano, “Using random undersampling to alleviate class imbalance on tweet sentiment
Data,” in 2015 IEEE International Conference on Information Reuse and Integration, IEEE, pp. 197–202.
Z. Zheng, Y. Cai, and Y. Li, “Oversampling method for imbalanced classification,” Computing and Informatics, vol. 34, no. 5, pp. 1017–1037, 2015.
Couronné, R., Probst, P. & Boulesteix, AL. “Random forest versus logistic regression: a large-scale benchmark experiment”. BMC Bioinformatics 19, 270 (2018). https://doi.org/10.1186/s12859-018-2264-5

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Occupational lung disease classical machine learning Random Forest XGBoost Logistic Regression SVM