COVID-19, a disease starting from December 2019, spreads from person to person through contact, and has symptoms of cough, fever, muscle pain, etc. The diagnosis is usually done by polymerase chain reaction (PCR) test which collects samples from the nasopharyngeal area. Today, machine learning or deep learning is used to analyze data such as confirmed cases or mortality, differentiate x-ray images of COVID-19 patients and others. Not many of the researches completed before predicted important features that influence COVID-19. Therefore, we mainly address the influence of related features. Our data includes demographic, geographic, and severity information in Toronto. The experiment was developed in this order: data import, label encoding, correlation matrix, train-test split, min-max normalization, machine learning models, gridsearchcv, and feature importance. We applied a boosting algorithm and light gradient boosting machine to increase accuracy and speed, gridsearchcv, feature importance function to find the importance of the variable and best hyper parameters for models. Among two experiments, the first experiment using a feature-selected model concluded important features such as outbreak associated, FSA, and classification with 88 percent accuracy. The second experiment that did not select features but used entire features resulted in that neighborhood name, FSA, and age group as important features. The accuracy was mostly around 89 percent. The data did not include personal information but mostly geographical information, which might have influenced the result, determining geographical features as key features of infection, and the accuracy. Yet, the model for the experiment has advanced computation speed, less memory usage, and showed impressive performance.
Published in | American Journal of Mathematical and Computer Modelling (Volume 6, Issue 3) |
DOI | 10.11648/j.ajmcm.20210603.11 |
Page(s) | 43-49 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2021. Published by Science Publishing Group |
COVID-19, Machine Learning, Random Forest, LGBM, Artificial Intelligence
[1] | Sauer, L. M. (n.d.). What Is Coronavirus? Johns Hopkins Medicine. https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus. |
[2] | Which COVID test is best? Pros and cons of coronavirus detection methods: COVID: UT Southwestern Medical Center. COVID | UT Southwestern Medical Center. (n.d.). https://utswmed.org/medblog/covid19-testing-methods/. |
[3] | Zoabi, Y., Deri-Rozov, S., & Shomron, N. (2021). Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4 (1). https://doi.org/10.1038/s41746-020-00372-6. |
[4] | Zargari Khuzani, A., Heidari, M., & Shariati, S. A. (2021). COVID-Classifier: an automated machine learning model to assist in the diagnosis of COVID-19 infection in chest X-ray images. Scientific Reports, 11 (1). https://doi.org/10.1038/s41598-021-88807-2. |
[5] | City of Toronto. (2021, June 2). COVID-19: Case Counts. City of Toronto. https://www.toronto.ca/home/covid-19/covid-19-latest-city-of-toronto-news/covid-19-pandemic-data/covid-19-weekday-status-of-cases-data/. |
[6] | City of Toronto. (2021, June 2). COVID 19: Vaccine Data. City of Toronto. https://www.toronto.ca/home/covid-19/covid-19-latest-city-of-toronto-news/covid-19-pandemic-data/covid-19-vaccine-data/. |
[7] | World Health Organization. (n.d.). The effects of virus variants on COVID-19 vaccines. World Health Organization. https://www.who.int/news-room/feature-stories/detail/the-effects-of-virus-variants-on-covid-19-vaccines. |
[8] | [Shahid, F., Zameer, A., & Muneeb, M. (2020). Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM. Chaos, Solitons & Fractals, 140, 110212. https://doi.org/10.1016/j.chaos.2020.110212. |
[9] | Pinter, G., Felde, I., Mosavi, A., Ghamisi, P., & Gloaguen, R. (2020). COVID-19 Pandemic Prediction for Hungary; A Hybrid Machine Learning Approach. Mathematics, 8 (6), 890. https://doi.org/10.3390/math8060890. |
[10] | Ardabili, S. F., Mosavi, A., Ghamisi, P., Ferdinand, F., Varkonyi-Koczy, A. R., Reuter, U., Rabczuk, T., & Atkinson, P. M. (2020). COVID-19 Outbreak Prediction with Machine Learning. Algorithms, 13 (10), 249. https://doi.org/10.3390/a13100249. |
[11] | Solanki, A., & Singh, T. (2021). COVID-19 Epidemic Analysis and Prediction Using Machine Learning Algorithms. Studies in Systems, Decision and Control, 57–78. https://doi.org/10.1007/978-3-030-60039-6_3. |
[12] | Cobre, A. de, Stremel, D. P., Noleto, G. R., Fachi, M. M., Surek, M., Wiens, A., Tonin, F. S., & Pontarolo, R. (2021). Diagnosis and prediction of COVID-19 severity: can biochemical tests and machine learning be used as prognostic indicators? Computers in Biology and Medicine, 134, 104531. https://doi.org/10.1016/j.compbiomed.2021.104531. |
[13] | Agrawal, D. (2020, July 17). Toronto COVID-19 Cases. Kaggle. https://www.kaggle.com/divyansh22/toronto-covid19-cases. |
[14] | AdaBoost. (2009). Encyclopedia of Biometrics, 9–9. https://doi.org/10.1007/978-0-387-73003-5_825. |
[15] | Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). |
[16] | Abou Omar, K. B. (2018). XGBoost and LGBM for Porto Seguro’s Kaggle challenge: A comparison. Preprint Semester Project. |
[17] | Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W.,... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 3146-3154. |
[18] | Syarif, I., Prugel-Bennett, A., & Wills, G. (2016). SVM parameter optimization using grid search and genetic algorithm to improve classification performance. Telkomnika, 14 (4), 1502. |
[19] | Centers for Disease Control and Prevention. (n.d.). Certain Medical Conditions and Risk for Severe COVID-19 Illness. Centers for Disease Control and Prevention. https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/people-with-medical-conditions.html. |
[20] | London, A. J. (2019). Artificial Intelligence and Black-Box Medical Decisions: Accuracy versus Explainability. Hastings Center Report, 49 (1), 15–21. https://doi.org/10.1002/hast.973. |
APA Style
Yein Choi. (2021). Predicting Important Features That Influence COVID-19 Infection Through Light Gradient Boosting Machine: Case of Toronto. American Journal of Mathematical and Computer Modelling, 6(3), 43-49. https://doi.org/10.11648/j.ajmcm.20210603.11
ACS Style
Yein Choi. Predicting Important Features That Influence COVID-19 Infection Through Light Gradient Boosting Machine: Case of Toronto. Am. J. Math. Comput. Model. 2021, 6(3), 43-49. doi: 10.11648/j.ajmcm.20210603.11
AMA Style
Yein Choi. Predicting Important Features That Influence COVID-19 Infection Through Light Gradient Boosting Machine: Case of Toronto. Am J Math Comput Model. 2021;6(3):43-49. doi: 10.11648/j.ajmcm.20210603.11
@article{10.11648/j.ajmcm.20210603.11, author = {Yein Choi}, title = {Predicting Important Features That Influence COVID-19 Infection Through Light Gradient Boosting Machine: Case of Toronto}, journal = {American Journal of Mathematical and Computer Modelling}, volume = {6}, number = {3}, pages = {43-49}, doi = {10.11648/j.ajmcm.20210603.11}, url = {https://doi.org/10.11648/j.ajmcm.20210603.11}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajmcm.20210603.11}, abstract = {COVID-19, a disease starting from December 2019, spreads from person to person through contact, and has symptoms of cough, fever, muscle pain, etc. The diagnosis is usually done by polymerase chain reaction (PCR) test which collects samples from the nasopharyngeal area. Today, machine learning or deep learning is used to analyze data such as confirmed cases or mortality, differentiate x-ray images of COVID-19 patients and others. Not many of the researches completed before predicted important features that influence COVID-19. Therefore, we mainly address the influence of related features. Our data includes demographic, geographic, and severity information in Toronto. The experiment was developed in this order: data import, label encoding, correlation matrix, train-test split, min-max normalization, machine learning models, gridsearchcv, and feature importance. We applied a boosting algorithm and light gradient boosting machine to increase accuracy and speed, gridsearchcv, feature importance function to find the importance of the variable and best hyper parameters for models. Among two experiments, the first experiment using a feature-selected model concluded important features such as outbreak associated, FSA, and classification with 88 percent accuracy. The second experiment that did not select features but used entire features resulted in that neighborhood name, FSA, and age group as important features. The accuracy was mostly around 89 percent. The data did not include personal information but mostly geographical information, which might have influenced the result, determining geographical features as key features of infection, and the accuracy. Yet, the model for the experiment has advanced computation speed, less memory usage, and showed impressive performance.}, year = {2021} }
TY - JOUR T1 - Predicting Important Features That Influence COVID-19 Infection Through Light Gradient Boosting Machine: Case of Toronto AU - Yein Choi Y1 - 2021/07/13 PY - 2021 N1 - https://doi.org/10.11648/j.ajmcm.20210603.11 DO - 10.11648/j.ajmcm.20210603.11 T2 - American Journal of Mathematical and Computer Modelling JF - American Journal of Mathematical and Computer Modelling JO - American Journal of Mathematical and Computer Modelling SP - 43 EP - 49 PB - Science Publishing Group SN - 2578-8280 UR - https://doi.org/10.11648/j.ajmcm.20210603.11 AB - COVID-19, a disease starting from December 2019, spreads from person to person through contact, and has symptoms of cough, fever, muscle pain, etc. The diagnosis is usually done by polymerase chain reaction (PCR) test which collects samples from the nasopharyngeal area. Today, machine learning or deep learning is used to analyze data such as confirmed cases or mortality, differentiate x-ray images of COVID-19 patients and others. Not many of the researches completed before predicted important features that influence COVID-19. Therefore, we mainly address the influence of related features. Our data includes demographic, geographic, and severity information in Toronto. The experiment was developed in this order: data import, label encoding, correlation matrix, train-test split, min-max normalization, machine learning models, gridsearchcv, and feature importance. We applied a boosting algorithm and light gradient boosting machine to increase accuracy and speed, gridsearchcv, feature importance function to find the importance of the variable and best hyper parameters for models. Among two experiments, the first experiment using a feature-selected model concluded important features such as outbreak associated, FSA, and classification with 88 percent accuracy. The second experiment that did not select features but used entire features resulted in that neighborhood name, FSA, and age group as important features. The accuracy was mostly around 89 percent. The data did not include personal information but mostly geographical information, which might have influenced the result, determining geographical features as key features of infection, and the accuracy. Yet, the model for the experiment has advanced computation speed, less memory usage, and showed impressive performance. VL - 6 IS - 3 ER -