| Peer-Reviewed

Mixture Model Clustering Using Variable Data Segmentation and Model Selection: A Case Study of Genetic Algorithm

Received: 23 August 2019     Accepted: 6 September 2019     Published: 23 September 2019
Views:       Downloads:
Abstract

A genetic algorithm for mixture model clustering using variable data segmentation and model selection is proposed in this study. Principle of the method is demonstrated on mixture model clustering of Ruspini data set. The segment numbers of the variables in the data set were determined and the variables were converted into categorical variables. It is shown that variable data segmentation forms the number and structure of cluster centers in data. Genetic Algorithms were used to determine the number of finite mixture models. The number of total mixture models and possible candidate mixture models among them are calculated using cluster centers formed by variable data segmentation in data set. Mixture of normal distributions is used in mixture model clustering. Maximum likelihood, AIC and BIC values were obtained by using the parameters in the data for each candidate mixture model. Candidate mixture models are established, to determine the number and structure of clusters, using sample means and variance-covariance matrices for data set. The best mixture model for model based clustering of data is selected according to information criteria among possible candidate mixture models. The number of components in the best mixture model corresponds to the number of clusters, and the components of the best mixture model correspond to the structure of clusters in data set.

Published in Mathematics Letters (Volume 5, Issue 2)
DOI 10.11648/j.ml.20190502.12
Page(s) 23-32
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2019. Published by Science Publishing Group

Keywords

Cluster Centers, Data Clustering, Data Mining, Genetic Algorithm, Information Criteria, Mixture Model Clustering, Model Selection, Variable Data Segmentation

References
[1] McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
[2] Fraley, C. and Raftery, A. E. (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association, 97, 611-631.
[3] Fraley, C. and Raftery, A. E. (1998). How Many Clusters? Which Clustering Method? Answers via Model-Based Cluster Analysis. The Computer Journal, 41, 578-588.
[4] Bozdogan, H. (1994a). Choosing the number of clusters, subset selection of variables, and outlier detection in the standart mixture model cluster anlysis. Invited paper in New Approaches in Classification and Data Ana lysis, E. Diday et al. (Eds.), Springer-Verlang, New York, pp. 169-177.
[5] Bozdogan, H. (1994b). Mixture-Model Cluster Analysis Using Model Selection Criteria And A New Informational Measure Of Complexity. In Multivariate Statistical Modeling, Vol. 2, H. Bozdogan (ed.), Kluwer Academic Publishers, Dordrecht, the Netherlands, 1994, pp. 69-113.
[6] Soffritti, G. (2003). Identifying multiple cluster structures in a data matrix. Communications in Statistics, Simulation & Computation, Vol. 32, Issue 4, pp. 1151-1181.
[7] Bozdogan, H. (2004). Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms. In Statistical Data Mining & Knowledge Discovery, H. Bozdogan (Ed.), Chapman & Hall/CRC, pp. 15-56.
[8] McLachlan, G. J. and Chang, S. U. (2004). Mixture Modelling for Cluster Analysis. Statistical Methods in Medical Research 13, 347-361.
[9] Galimberti, G. and Soffritti, G. (2007). Model-based methods to identify multiple cluster structures in a data set. Computational Statistics and Data Analysis. doi 10.1016/j.csda.2007.02.019.
[10] Durio, A. and Isaia, E. D. (2007). A quick procedure for model selection in the case of mixture of normal densities. Computational Statistics and Data Analysis. 51, 5635-5643.
[11] Scrucca, L. (2010). Dimension reduction for model-based clustering. Statistics and Computing, 20 (4), 471-484.
[12] Seo, B. and Kim, D. (2012). Root selection in normal mixture models. Computational Statistics and Data Analysis. 56, 2454-2470.
[13] Fraley, C., Raftery, A. E., and Scrucca, L. (2012). Normal mixture modeling for model-based clustering, classification, and density estimation. Department of Statistics, University of Washington, Available online at http://cran.r-project. org/web/packages/mclust/index.html. Accessed September, 23, 2012.
[14] Erol, H. (2013). A model selection algorithm for mixture model clustering of heterogeneous multivariate data. In Innovations in Intelligent Systems and Applications (INISTA), 2013 IEEE International Symposium. pp. 1-7.
[15] Huang, T., Peng, H., and Zhang, K. (2013). Model Selection for Gaussian Mixture Models. arXiv preprint arXiv: 1301.3558.
[16] McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. New York, Wiley.
[17] Galimberti, G., and Soffritti, G. (2013). Using conditional independence for parsimonious model-based Gaussian clustering. Statistics and Computing, 23 (5), 625-638.
[18] McLachlan, G. J., and Rathnayake, S. (2014). On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4 (5), 341-355.
[19] Wei, Y. and McNicholas, P. D. (2015). Mixture model averaging for clustering. Adv Data Anal Classif (2015) 9: 197–217. DOI 10.1007/s11634-014-0182-6.
[20] Bouveyrona, C. and Brunet-Saumardb, C. (2014). Model-based clustering of high-dimensional data: A review Computational Statistics and Data Analysis. 71, 52–78.
[21] Maitra, R., and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19 (2), 354-376.
[22] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 (6): 716–723.
[23] Schwarz, G. (1978). Estimating the dimension of a model, Ann. Statist. 6 pp. 461–464.
[24] Servi, T. and Erol, H. (2007). On Total Number of Candidate Component Cluster Centers and Total Number of Candidate Mixture Models In Model Based Clustering. Selçuk Journal of Applied Mathematics Vol. 8. No. 2. pp. 57-69.
[25] Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley-Interscience.
[26] Cheballah, H., Giraudo, S. And Maurice, R. (2015) Combinatorial Hopf Algebra Structure On Packed Square Matrices. Journal of Combinatorial Theory Series A, Volume 133, Issue C, Pages 139-182. doi: 10.1016/j.jcta.2015.02.001.
[27] Erişoğlu Ü., Erişoğlu M., and Erol H., “Mixture Model Approach To The Analysis Of Heterogeneous Survival Data,” Pakistan Journal Of Statistics, vol. 28, no. 1, pp. 115–130, Jan. 2012.
[28] Akogul, S., & Erisoglu, M. (2016). A Comparison of Information Criteria in Clustering Based on Mixture of Multivariate Normal Distributions. Mathematical and Computational Applications, 21 (3), 34–0.
[29] Akogul, S., & Erisoglu, M. (2017). An Approach for Determining the Number of Clusters in a Model Based Cluster Analysis. Entropy, 19 (9), 452–0.
[30] Celeux, G., Fruewirth-Schnatter, S., & Robert, C. P. (2018). Model selection for mixture models-perspectives and strategies. arXiv preprint arXiv: 1812.09885.
[31] Gogebakan, M., & Erol, H. (2018). A new semi-supervised classification method based on mixture model clustering for classification of multispectral data. Journal of the Indian Society of Remote Sensing, 46 (8), 1323-1331.
Cite This Article
  • APA Style

    Maruf Gogebakan, Hamza Erol. (2019). Mixture Model Clustering Using Variable Data Segmentation and Model Selection: A Case Study of Genetic Algorithm. Mathematics Letters, 5(2), 23-32. https://doi.org/10.11648/j.ml.20190502.12

    Copy | Download

    ACS Style

    Maruf Gogebakan; Hamza Erol. Mixture Model Clustering Using Variable Data Segmentation and Model Selection: A Case Study of Genetic Algorithm. Math. Lett. 2019, 5(2), 23-32. doi: 10.11648/j.ml.20190502.12

    Copy | Download

    AMA Style

    Maruf Gogebakan, Hamza Erol. Mixture Model Clustering Using Variable Data Segmentation and Model Selection: A Case Study of Genetic Algorithm. Math Lett. 2019;5(2):23-32. doi: 10.11648/j.ml.20190502.12

    Copy | Download

  • @article{10.11648/j.ml.20190502.12,
      author = {Maruf Gogebakan and Hamza Erol},
      title = {Mixture Model Clustering Using Variable Data Segmentation and Model Selection: A Case Study of Genetic Algorithm},
      journal = {Mathematics Letters},
      volume = {5},
      number = {2},
      pages = {23-32},
      doi = {10.11648/j.ml.20190502.12},
      url = {https://doi.org/10.11648/j.ml.20190502.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ml.20190502.12},
      abstract = {A genetic algorithm for mixture model clustering using variable data segmentation and model selection is proposed in this study. Principle of the method is demonstrated on mixture model clustering of Ruspini data set. The segment numbers of the variables in the data set were determined and the variables were converted into categorical variables. It is shown that variable data segmentation forms the number and structure of cluster centers in data. Genetic Algorithms were used to determine the number of finite mixture models. The number of total mixture models and possible candidate mixture models among them are calculated using cluster centers formed by variable data segmentation in data set. Mixture of normal distributions is used in mixture model clustering. Maximum likelihood, AIC and BIC values were obtained by using the parameters in the data for each candidate mixture model. Candidate mixture models are established, to determine the number and structure of clusters, using sample means and variance-covariance matrices for data set. The best mixture model for model based clustering of data is selected according to information criteria among possible candidate mixture models. The number of components in the best mixture model corresponds to the number of clusters, and the components of the best mixture model correspond to the structure of clusters in data set.},
     year = {2019}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Mixture Model Clustering Using Variable Data Segmentation and Model Selection: A Case Study of Genetic Algorithm
    AU  - Maruf Gogebakan
    AU  - Hamza Erol
    Y1  - 2019/09/23
    PY  - 2019
    N1  - https://doi.org/10.11648/j.ml.20190502.12
    DO  - 10.11648/j.ml.20190502.12
    T2  - Mathematics Letters
    JF  - Mathematics Letters
    JO  - Mathematics Letters
    SP  - 23
    EP  - 32
    PB  - Science Publishing Group
    SN  - 2575-5056
    UR  - https://doi.org/10.11648/j.ml.20190502.12
    AB  - A genetic algorithm for mixture model clustering using variable data segmentation and model selection is proposed in this study. Principle of the method is demonstrated on mixture model clustering of Ruspini data set. The segment numbers of the variables in the data set were determined and the variables were converted into categorical variables. It is shown that variable data segmentation forms the number and structure of cluster centers in data. Genetic Algorithms were used to determine the number of finite mixture models. The number of total mixture models and possible candidate mixture models among them are calculated using cluster centers formed by variable data segmentation in data set. Mixture of normal distributions is used in mixture model clustering. Maximum likelihood, AIC and BIC values were obtained by using the parameters in the data for each candidate mixture model. Candidate mixture models are established, to determine the number and structure of clusters, using sample means and variance-covariance matrices for data set. The best mixture model for model based clustering of data is selected according to information criteria among possible candidate mixture models. The number of components in the best mixture model corresponds to the number of clusters, and the components of the best mixture model correspond to the structure of clusters in data set.
    VL  - 5
    IS  - 2
    ER  - 

    Copy | Download

Author Information
  • Department of Maritime Business and Administration, Maritime Faculty, Bandirma Onyedi Eylul University, Bandirma, Turkey

  • Department of Computer Engineering, Faculty of Engineering, Mersin University, Mersin, Turkey

  • Sections