Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers.
Published in | Computational Biology and Bioinformatics (Volume 12, Issue 1) |
DOI | 10.11648/j.cbb.20241201.13 |
Page(s) | 18-31 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2024. Published by Science Publishing Group |
Protein Sequence, Features, Amino Acid Pair, Evolutionary Classification
[1] | Gupta, M. K, Niyogi, R., Misra, M. A. A 2D graphical representation of protein sequence and their similarity analysis with probabilistic method. Match-commun. Math. Co. 2014, 72(2), 519–532. |
[2] | He, P. A new graphical representation of similarity/dissimilarity studies of protein sequences. SAR QSAR in Environ. Res. 2010, 21(5-6), 571-580. |
[3] | Hu, J., Huang, G. Similarity/dissimilarity analysis of protein sequences by a new graphical representation. Curr. Bioinf. 2013, 8, 539–544. |
[4] | Li, Z., Geng, C., He, P., Yao, Y. A novel method of 3D graphical representation and similarity analysis for proteins. Match. 2014, 71(1), 213-226. |
[5] | Liu, Y., Li, D., Lu, K., Jiao, Y., He P. P-H Curve, a Graphical Representation of Protein Sequences for Similarities Analysis. Match-commun. Math. Co. 2013, 70(1), 451–566. |
[6] | Yao, Y., Dai, Q., Li, C., He, P., Nan X. Analysis of similarity/dissimilarity of protein sequences. Proteins: Struct., Funct., Bioinf. 2008, 73(4), 864-871. |
[7] | Mu, Z., Yu, T., Liu, X., Zheng, H., Wei, L., Liu, J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinf. 2021, 22(1), 297. |
[8] | Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W. M. Alignment-free sequence comparison: benefts, applications, and tools. Genome Biol. 2017, 18(1), 186. |
[9] | Rackovsky, S. Sequence physical properties encode the global organization of protein structure space. Proc. Natl. Acad. Sci. 2009, 106(34), 14345–14348. |
[10] | Yu, C., Deng, M., Cheng, S. Y., Yau, S. C., He, R. L., Yau, S. S.-T. Protein space: A natural method for realizing the nature of protein universe. J. of Theor. Biol. 2013, 318, 197–204. |
[11] | Shen, H., Chou, K. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 2008, 373, 386-388. |
[12] | Yau, S. S.-T, Yu, C., He, R. L. A protein map and its application. DNA Cell Biol. 2008, 27, 241-250. |
[13] | Yu, C., Cheng, S. Y., He, R. L., Yau, S. S.-T. Protein map: An alignment-free sequence comparison method based on various properties of amino acids. Gene. 2011, 486(1–2), 110–118. |
[14] | Liu, B., Liu, F., Wang, X., Chen, J., Fang, L., Chou, K. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43(W1), W65-W71. |
[15] | He, P., Zhang, Y., Yao, Y., Tang, Y., Nan, X. The graphical representation of protein sequences based on the physicochemical properties and its applications. J. Comput. Chem. 2010, 31, 2136–2142. |
[16] | Wu, Z., Xiao, X., Chou, K. C. 2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. J. Theor. Biol. 2010, 267, 29–34. |
[17] | Yu, J., Qu, A., Tang, H., Wang, F., Wang C., Wang, H., Wang, J., Zhu H. A novel numerical model for protein sequences analysis based on spherical coordinates and multiple physicochemical properties of amino acids. Biopolymers. 2019, 110, e23282. |
[18] | Randić, M. 2-D graphical representation of proteins based on physicochemical properties of amino acids. Chem. Phys. Lett. 2008, 440(4-6), 291–295. |
[19] | Zhang, Y., Wen, J., Yau, S. S.-T. Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics. 2019, 111, 1298–1305. |
[20] | Yu, C., He, R. L., Yau, S. S.-T. Protein sequence comparison based on K-string dictionary. Gene. 2013, 529(2), 250-256. |
[21] | Chang, C. H., Nelson, W. C., Jerger, A., Wright, A. T., Egbert, R. G., McDermott, J. E. Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recording. Bioinform Adv. 2023, 3(1), vbad005. |
[22] | Ghandi, M., Mohammad-Noori, M., Ghareghani, N., Lee, D., Garraway, L., Beer, M. A. GkmSVM: an R package for gapped-kmer SVM. Bioinformatics. 2016, 32(14), 2205-2207. |
[23] | Liu, B., Wang, S., Dong, Q., Li, S., Liu, X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE T. on Nanobiosci. 2016, 15(4), 328-334. |
[24] | Wen, J., Zhang, Y., Yau, S. S.-T. K-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J. Theor. Biol. 2014, 363, 145-150. |
[25] | Kim, T. K., Bunron, L. Fast Global Alignment Technique Using Kmer-Distance and Parallelism. BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services Jeju Island Republic of Korea. 2015. |
[26] | Liu, Y., Wang, X., Liu, B. IDP–CRF: Intrinsically Disordered Protein/Region Identifification Based on Conditional Random Fields. Int J Mol Sci. 2018, 19(9), 2483. |
[27] | Wen, J., Chan, R. H. F., Yau, S. C., He, R. L., Yau, S. S.-T. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene. 2014, 546(1), 25-34. |
[28] | Naznin, F., Sarker, R., Essam, D. Two Hybrid Algorithms for Multiple Sequence Alignment. AIP Conf. Proc. 2010, 1210(1), 69-83. |
[29] | Yang, X. W., Wang, T. M. A novel statistical measure for sequence comparison on the basis of k-word counts. J. Theor. Biol. 2013, 318, 91–100. |
[30] | Yu, H. J. Segmented K-mer and its application on similarity analysis of mitochondrial genome sequences. Gene. 2013, 518, 419–424. |
[31] | Tian K., Zhao X., Zhang Y., Yau S. Comparing protein structures and inferring functions with a novel three-dimensional Yau-Hausdorff method. J. Biomol. Struct. Dyn. 2019, 37(16), 4151-60. |
[32] | Morikawa N. Discrete differential geometry of n-simplices and protein structure analysis. Applied Mathematics. 2014, 5(16), 2458-2463. |
APA Style
Wan, X. G., Tan, X. Y., Cao, J. (2024). A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Computational Biology and Bioinformatics, 12(1), 18-31. https://doi.org/10.11648/j.cbb.20241201.13
ACS Style
Wan, X. G.; Tan, X. Y.; Cao, J. A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Comput. Biol. Bioinform. 2024, 12(1), 18-31. doi: 10.11648/j.cbb.20241201.13
AMA Style
Wan XG, Tan XY, Cao J. A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications. Comput Biol Bioinform. 2024;12(1):18-31. doi: 10.11648/j.cbb.20241201.13
@article{10.11648/j.cbb.20241201.13, author = {Xiao geng Wan and Xin ying Tan and Jun Cao}, title = {A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications }, journal = {Computational Biology and Bioinformatics}, volume = {12}, number = {1}, pages = {18-31}, doi = {10.11648/j.cbb.20241201.13}, url = {https://doi.org/10.11648/j.cbb.20241201.13}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.cbb.20241201.13}, abstract = {Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers.}, year = {2024} }
TY - JOUR T1 - A Study on Novel Amino Acid Pair Features for Protein Evolutionary Classifications AU - Xiao geng Wan AU - Xin ying Tan AU - Jun Cao Y1 - 2024/09/23 PY - 2024 N1 - https://doi.org/10.11648/j.cbb.20241201.13 DO - 10.11648/j.cbb.20241201.13 T2 - Computational Biology and Bioinformatics JF - Computational Biology and Bioinformatics JO - Computational Biology and Bioinformatics SP - 18 EP - 31 PB - Science Publishing Group SN - 2330-8281 UR - https://doi.org/10.11648/j.cbb.20241201.13 AB - Protein evolutionary classification from amino acid sequence is one of the hot research topics in computational biology and bioinformatics. The amino acid composition and arrangement in a protein sequence embed the hints to its evolutionary origins. The feature extraction from an amino acid sequence to a numerical vector is still a challenging problem. Traditional feature methods extract protein sequence information either from individual amino acids or kmers aspects, which have general performance with limitations in classification accuracy. To further improve the accuracy in protein evolutionary classifications, six new features defined on separated amino acid pairs are proposed for protein evolutionary classification analysis, where composition and arrangement as well as physical properties are considered for the different combinations of separated amino acid pairs. Different from general consideration of amino acid pairs, the new features account for the features of separated amino acid pairs with spatial intervals in the sequence, which may deeper reflect the spatial relationships and characters between the amino acid in pairs. In test of the performances of the new features, five standard protein evolutionary classification examples are employed, where the new features proposed are compared with classical protein sequence features such as averaged property factors (APF), natural vector (NV) and pseudo amino acid composition (PseAAC) as well as kmer versions of these features. The area under precision-recall curve (AUPRC) analysis shows that the new features are efficient in evolutionary classifications, which outperform traditional protein sequence features that are based on individual amino acids and kmers. Parameter analysis on the novel separated amino acid pair features and kmer features show that the features of some medium or longer length of amino acid pair intervals and kmers may achieve higher classification accuracy in evolutionary classifications. From this analysis, the newly proposed separated amino acid pairs with spacial intervals are proved to be efficient units in extracting protein sequences features, which may interpret richer evolutionary information of protein sequences than individual amino acids and kmers. VL - 12 IS - 1 ER -