In most research fields, the amount of data produced is growing very fast. Analysis of big data offers potentially unlimited opportunities for information discovery. However, due to the high dimensions and presence of outliers, there is a need for a suitable algorithm for dimensionality reduction. By performing dimensionality reduction, we can learn low dimensional embeddings which capture most of the variability in data. This study proposes a new approach, Neighbourhood Components Analysis (NCA) a nearest-neighbor-based non-parametric method for learning low-dimensional linear embeddings of labeled data. This means that the approach uses class labels to guide the dimensionality reduction (DR) process. Neighborhood Components Analysis learns a low-dimensional linear projection of the feature space to improve the performance of a nearest neighbour classifier in the projected space. The method avoids making parametric assumptions about the data and therefore, can work well with complex or multi-modal data, which is the case with most real-world data. We evaluated the efficiency of our method for dimensionality reduction of data by comparing the classification errors and class separability of the embedded data with that of Principal Component Analysis (PCA). The result shows a significant reduction in the dimensions of the data from 754 to 55 dimensions. Neighborhood Components Analysis outperformed Principal Components Analysis in classification error across a range of dimensions. Analysis conducted on real and simulated datasets showed that the proposed algorithm is generally insensitive to the increase in the number of outliers and irrelevant features and consistently outperformed the classical Principal Component Analysis method.
Published in | International Journal of Data Science and Analysis (Volume 8, Issue 3) |
DOI | 10.11648/j.ijdsa.20220803.11 |
Page(s) | 72-81 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2022. Published by Science Publishing Group |
Dimensionality Reduction, Neighbourhood Components Analysis (NCA), Principal Component Analysis (PCA), Outlier Detection
[1] | W. Yang, K. Wang and W. Zuo, "Neighborhood component feature selection for high-dimensional data.," JCP, vol. 7, p. 161–168, 2012. |
[2] | D. M. Hawkins, Identification of outliers, vol. 11, Springer, 1980. |
[3] | C. Qin, S. Song and G. Huang, "Non-linear neighborhood component analysis based on constructive neural networks," in 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2014. |
[4] | A. Datta, S. Ghosh and A. Ghosh, "PCA, kernel PCA and dimensionality reduction in hyperspectral images," in Advances in Principal Component Analysis, Springer, 2018, p. 19–46. |
[5] | X. Wu, X. Zhu, G.-Q. Wu and W. Ding, "Data mining with big data," IEEE transactions on knowledge and data engineering, vol. 26, p. 97–107, 2013. |
[6] | J. Fan, F. Han and H. Liu, "Challenges of big data analysis," National science review, vol. 1, p. 293–314, 2014. |
[7] | O. Shetta and M. Niranjan, "Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality," Royal Society open science, vol. 7, p. 190714, 2020. |
[8] | S. Roweis, G. Hinton and R. Salakhutdinov, "Neighbourhood component analysis," Adv. Neural Inf. Process. Syst.(NIPS), vol. 17, p. 513–520, 2004. |
[9] | K. Pearson, "LIII. On lines and planes of closest fit to systems of points in space," The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, p. 559–572, 1901. |
[10] | H. Hotelling, "Analysis of a complex of statistical variables into principal components.," Journal of educational psychology, vol. 24, p. 417, 1933. |
[11] | I. T. Jolliffe, "Principal components in regression analysis," Principal component analysis, p. 167–198, 2002. |
[12] | W. Astuti and others, "Support vector machine and principal component analysis for microarray data classification," in Journal of Physics: Conference Series, 2018. |
[13] | G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, R. Kaluri, D. S. Rajput, G. Srivastava and T. Baker, "Analysis of dimensionality reduction techniques on big data," IEEE Access, vol. 8, p. 54776–54788, 2020. |
[14] | N. Singh-Miller, M. Collins and T. J. Hazen, "Dimensionality reduction for speech recognition using neighborhood components analysis," in Eighth Annual Conference of the International Speech Communication Association, 2007. |
[15] | N. Singh-Miller, "Neighborhood analysis methods in acoustic modeling for automatic speech recognition," 2010. |
[16] | J. Manit and P. Youngkong, "Neighborhood components analysis in sEMG signal dimensionality reduction for gait phase pattern recognition," in 7th International Conference on Broadband Communications and Biomedical Applications, 2011. |
[17] | M. Rizwan and D. V. Anderson, "Speaker similarity score based fast phoneme classification by using neighborhood components analysis," in 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016. |
[18] | H. Ferdinando, T. Seppänen and E. Alasaarela, "Emotion recognition using neighborhood components analysis and ECG/HRV-based features," in International Conference on Pattern Recognition Applications and Methods, 2017. |
[19] | H. Ferdinando and E. Alasaarela, "Enhancement of emotion recogniton using feature fusion and the neighborhood components analysis," 2018. |
[20] | G. R. Naik, Advances in Principal Component Analysis: Research and Development, Springer, 2017. |
[21] | C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. Isenkul and H. Apaydin, "A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform," Applied Soft Computing, vol. 74, p. 255–263, 2019. |
[22] | V. Fritsch, G. Varoquaux, B. Thyreau, J.-B. Poline and B. Thirion, "Detecting outlying subjects in high-dimensional neuroimaging datasets with regularized minimum covariance determinant," in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2011. |
APA Style
Hannah Kariuki, Samuel Mwalili, Anthony Waititu. (2022). Dimensionality Reduction of Data with Neighbourhood Components Analysis. International Journal of Data Science and Analysis, 8(3), 72-81. https://doi.org/10.11648/j.ijdsa.20220803.11
ACS Style
Hannah Kariuki; Samuel Mwalili; Anthony Waititu. Dimensionality Reduction of Data with Neighbourhood Components Analysis. Int. J. Data Sci. Anal. 2022, 8(3), 72-81. doi: 10.11648/j.ijdsa.20220803.11
@article{10.11648/j.ijdsa.20220803.11, author = {Hannah Kariuki and Samuel Mwalili and Anthony Waititu}, title = {Dimensionality Reduction of Data with Neighbourhood Components Analysis}, journal = {International Journal of Data Science and Analysis}, volume = {8}, number = {3}, pages = {72-81}, doi = {10.11648/j.ijdsa.20220803.11}, url = {https://doi.org/10.11648/j.ijdsa.20220803.11}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20220803.11}, abstract = {In most research fields, the amount of data produced is growing very fast. Analysis of big data offers potentially unlimited opportunities for information discovery. However, due to the high dimensions and presence of outliers, there is a need for a suitable algorithm for dimensionality reduction. By performing dimensionality reduction, we can learn low dimensional embeddings which capture most of the variability in data. This study proposes a new approach, Neighbourhood Components Analysis (NCA) a nearest-neighbor-based non-parametric method for learning low-dimensional linear embeddings of labeled data. This means that the approach uses class labels to guide the dimensionality reduction (DR) process. Neighborhood Components Analysis learns a low-dimensional linear projection of the feature space to improve the performance of a nearest neighbour classifier in the projected space. The method avoids making parametric assumptions about the data and therefore, can work well with complex or multi-modal data, which is the case with most real-world data. We evaluated the efficiency of our method for dimensionality reduction of data by comparing the classification errors and class separability of the embedded data with that of Principal Component Analysis (PCA). The result shows a significant reduction in the dimensions of the data from 754 to 55 dimensions. Neighborhood Components Analysis outperformed Principal Components Analysis in classification error across a range of dimensions. Analysis conducted on real and simulated datasets showed that the proposed algorithm is generally insensitive to the increase in the number of outliers and irrelevant features and consistently outperformed the classical Principal Component Analysis method.}, year = {2022} }
TY - JOUR T1 - Dimensionality Reduction of Data with Neighbourhood Components Analysis AU - Hannah Kariuki AU - Samuel Mwalili AU - Anthony Waititu Y1 - 2022/05/10 PY - 2022 N1 - https://doi.org/10.11648/j.ijdsa.20220803.11 DO - 10.11648/j.ijdsa.20220803.11 T2 - International Journal of Data Science and Analysis JF - International Journal of Data Science and Analysis JO - International Journal of Data Science and Analysis SP - 72 EP - 81 PB - Science Publishing Group SN - 2575-1891 UR - https://doi.org/10.11648/j.ijdsa.20220803.11 AB - In most research fields, the amount of data produced is growing very fast. Analysis of big data offers potentially unlimited opportunities for information discovery. However, due to the high dimensions and presence of outliers, there is a need for a suitable algorithm for dimensionality reduction. By performing dimensionality reduction, we can learn low dimensional embeddings which capture most of the variability in data. This study proposes a new approach, Neighbourhood Components Analysis (NCA) a nearest-neighbor-based non-parametric method for learning low-dimensional linear embeddings of labeled data. This means that the approach uses class labels to guide the dimensionality reduction (DR) process. Neighborhood Components Analysis learns a low-dimensional linear projection of the feature space to improve the performance of a nearest neighbour classifier in the projected space. The method avoids making parametric assumptions about the data and therefore, can work well with complex or multi-modal data, which is the case with most real-world data. We evaluated the efficiency of our method for dimensionality reduction of data by comparing the classification errors and class separability of the embedded data with that of Principal Component Analysis (PCA). The result shows a significant reduction in the dimensions of the data from 754 to 55 dimensions. Neighborhood Components Analysis outperformed Principal Components Analysis in classification error across a range of dimensions. Analysis conducted on real and simulated datasets showed that the proposed algorithm is generally insensitive to the increase in the number of outliers and irrelevant features and consistently outperformed the classical Principal Component Analysis method. VL - 8 IS - 3 ER -