Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.
Published in | International Journal of Data Science and Analysis (Volume 5, Issue 6) |
DOI | 10.11648/j.ijdsa.20190506.13 |
Page(s) | 123-127 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2019. Published by Science Publishing Group |
Classification, Class Imbalanced, Resampling Techniques, Logistic Model, Terminated Pregnancy
[1] | Wang, S., Member, and Xin Yao, (2012), “Multiclass Imbalance Problems: Analysis and Potential Solutions”, IEEE Transactions On Systems, Man, And Cybernetics—Part B: Cybernetics, Vol. 42, No. 4. |
[2] | Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Ko lcz, (2004) “Editorial: Special Issue on Learning from Imbalanced Data Sets”; ACM SIGKDD Explorations Newsletter; Volume 6, Issue 1 - Page 1-6. Doi: 10.1145/1007730.1007733. |
[3] | Longadge. R., Dongre. S. S., and Malik, L., (2013), Class Imbalance Problem in Data Mining: Review; International Journal of Computer Science and Network (IJCSN); Vol. 2, Issue 1. |
[4] | Galar, M. and Fransico, (2012) “A review on Ensembles for the class Imbalance Problem: Bagging, Boosting and Hybrid Based Approaches” IEEE Transactions on Systems, Man, And Cybernetics—Part C: Application and Reviews, Vol. 42, No. 4. |
[5] | Chawla V. N., Bowyer K. W., Hall L. O., Kegelmeyer W. P., (2002), SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357. |
[6] | Brown, I. and C. Mues, (2012), An Experimental Comparison of Classification Algorithms for Imbalanced Credit Scoring Data Sets, Expert Systems with Applications, 39 (2012), no. 3, 3446-3453. http://dx.doi.org/10.1016/j.eswa.2011.09.033. |
[7] | Seiffert C., Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano, (2008) “A Comparative Study of Data Sampling and Cost Sensitive Learning”, IEEE International Conference on Data Mining Workshops. 15-19. |
[8] | Liu, P., Lijun Cai, Yong Wang, Longbo Zhang, (2010) “Classifying Skewed Data Streams Based on Reusing Data”; International Conference on Computer Application and System Modeling (ICCASM 2010). |
[9] | Tang, Y., Zhang, Y., Chawla, N. V., and Sven Krasser; (2009), “Correspondence SVMs Modeling for Highly Imbalanced Classification”; IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 39, No. 1. |
[10] | Agresti, A., (2002) Categorical Data Analysis, John Willey & Sons, Inc, New York. |
[11] | Fawcett, T., (2006), An Introduction to ROC analysis, Pattern Recognition Letters, 27, 861-874. http://dx.doi.org/10.1016/j.patrec.2005.10.010. |
[12] | Hanifah, F. S, Wijayanto, H. and Kurnia, A. (2015). SMOTE Bagging Algorithm for Imbalanced Dataset in Logistic Regression Analysis. Applied Mathematical Sciences, Vol. 9, 2015, no. 138, 6857-6865. http://dx.doi.org/10.12988/ams.2015.58562. |
[13] | Torgo, L. (2010). Data Mining with R, learning with case studies Chapman and Hall/CRC. URL: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR. |
[14] | R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. |
[15] | National Population Commission (NPC) [Nigeria] and ICF International. 2014. Nigeria Demographic and Health Survey 2013. Abuja, Nigeria, and Rockville, Maryland, USA: NPC and ICF International. |
[16] | Lunardon, Giovanna Menardi, and Nicola Torelli (2014). ROSE: a Package for Binary Imbalanced Learning. R Journal, 6 (1), 82-92. |
[17] | Kuhn, M., Wing, J., Weston, S., Williams, A., Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-81. https://CRAN.R-project.org/package=caret. |
APA Style
Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. (2019). Issues of Class Imbalance in Classification of Binary Data: A Review. International Journal of Data Science and Analysis, 5(6), 123-127. https://doi.org/10.11648/j.ijdsa.20190506.13
ACS Style
Samuel Adewale Aderoju; Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int. J. Data Sci. Anal. 2019, 5(6), 123-127. doi: 10.11648/j.ijdsa.20190506.13
AMA Style
Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int J Data Sci Anal. 2019;5(6):123-127. doi: 10.11648/j.ijdsa.20190506.13
@article{10.11648/j.ijdsa.20190506.13, author = {Samuel Adewale Aderoju and Emmanuel Teju Jolayemi}, title = {Issues of Class Imbalance in Classification of Binary Data: A Review}, journal = {International Journal of Data Science and Analysis}, volume = {5}, number = {6}, pages = {123-127}, doi = {10.11648/j.ijdsa.20190506.13}, url = {https://doi.org/10.11648/j.ijdsa.20190506.13}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20190506.13}, abstract = {Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.}, year = {2019} }
TY - JOUR T1 - Issues of Class Imbalance in Classification of Binary Data: A Review AU - Samuel Adewale Aderoju AU - Emmanuel Teju Jolayemi Y1 - 2019/11/17 PY - 2019 N1 - https://doi.org/10.11648/j.ijdsa.20190506.13 DO - 10.11648/j.ijdsa.20190506.13 T2 - International Journal of Data Science and Analysis JF - International Journal of Data Science and Analysis JO - International Journal of Data Science and Analysis SP - 123 EP - 127 PB - Science Publishing Group SN - 2575-1891 UR - https://doi.org/10.11648/j.ijdsa.20190506.13 AB - Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria. VL - 5 IS - 6 ER -