Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.
Published in | International Journal of Genetics and Genomics (Volume 4, Issue 3) |
DOI | 10.11648/j.ijgg.20160403.13 |
Page(s) | 24-30 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2016. Published by Science Publishing Group |
Gene Sequence, BWT Index, Suffix Array, Hadoop
[1] | ZHAO Ya-Nan Research on BWT index and algorithms in biological sequence alignment [D], University of Science and Technology of China.)2015. |
[2] | MCKENNA A, HANNA M, BANKS E, et al. The Genome Analysis Toolkit: a Map Reduce framework for analyzing next-generation DNA sequencing data. [J]. Genome research, 2010, 20(9):1297-1303. |
[3] | Niek B, Inna D, Lior P. AVID: a global alignment Program. Genome Research [J]. 2003, 13(1): 97–102. |
[4] | Delcher A L, Phillippy A, Carlton J, et al. Fast algorithms for large-scale genome alignment and comparison [J]. Nucleic Acids es, 2002, 30(11): 2478–2483. |
[5] | Kurtz S, Phillippy A L, Smoot M, et al. ersatile and open software for comparing large genomes[J]. Genome Biol, 2004, 5(2): R12. |
[6] | FERRAGINA P, MANZINI G., MAKINEN V, et al. AN Alphabet-Friendly FM-index[C]//APOSTOLICO A, MELUCCI M.11th International |
[7] | LANGMEAD B, TRAPNELL C, POP M, et al. Ultrafast and memory-efficient alignment of short DNA sequence to the human genome [J]. Genome Biology, 2009, 10(3): R25. |
[8] | Li H, Durbin R. Fast and accurate long-read alignment with” Burrow Wheeler” transform [J]. Bioinformatics, 2010, 26(5):589-595. |
[9] | LAMT W, SUNG WK, TAM SL, et al. Compressed indexing and local alignment of DNA [J]. Bioinformatics, 2008, 24(6): 791-797. |
[10] | Burrows M. A block-sorting lossless data compression algorithm [J]. Technical Report Digital Src Research Report, 1994, 57(4): 425. |
[11] | GAO Jing, JIAO Ya ZHANG Wen-Guang Overview of Sequence Alignment for HIGH-throughput Sequencing Data[J] LIFE SCIENCE RESEARCH 2014(05). |
[12] | Li Bing, Long Bing-jie, Liu Yong. A Fast Algorithm for Burrows-Wheeler Transform Using Suffix Sorting [J]. Journal of Electronics & Information Technology, 2015(02): 504-508. |
[13] | YANG Xiao-Liang, The Application Case Study of Mapreduce Parallel Computation and the Optimization of Its Runtime Framework [D] Nanjing University, 2012. |
[14] | FOSTER I, YONG ZHAO, RAICU I, et al. Cloud computing and grid computing 360-degree compared [C] Proceedings of the 2008Grid Computing Environments |
[15] | DEAN J, GHEMAWAT S. Map Reduce: simplified data processing on large clusters[C] //Proceedings of the 6th Symposium on Opera-ting System Design and Implementation. New York: ACM, 2004: 137-150 |
[16] | DONG Xicheng. Hadoop Technology Insider [M]. Beijing: China Machine Press), 2013: 29-32 |
[17] | Schatz M C. Cloud Burst: highly sensitive read mapping with Map Reduce. [J]. Bioinformatics, 2009, 25(11): 1363–1369. |
[18] | “ftp://ftp.ncbi.nih.gov/” |
[19] | White T. Hadoop: The Definitive Guide [J]. O’reilly Media Inc Gravenstein Highway North, 2010, 215(11):1-4. |
[20] | MIAO Su-Chao, An Anchor-based Algorithm for Multiple Genome Alignment [D], Xidian University), 2010. |
APA Style
Nan Li, Jing Gao, Bailong Feng. (2016). The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. International Journal of Genetics and Genomics, 4(3), 24-30. https://doi.org/10.11648/j.ijgg.20160403.13
ACS Style
Nan Li; Jing Gao; Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int. J. Genet. Genomics 2016, 4(3), 24-30. doi: 10.11648/j.ijgg.20160403.13
AMA Style
Nan Li, Jing Gao, Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int J Genet Genomics. 2016;4(3):24-30. doi: 10.11648/j.ijgg.20160403.13
@article{10.11648/j.ijgg.20160403.13, author = {Nan Li and Jing Gao and Bailong Feng}, title = {The Method of a Gene Sequence Alignment BWT Index Based on Hadoop}, journal = {International Journal of Genetics and Genomics}, volume = {4}, number = {3}, pages = {24-30}, doi = {10.11648/j.ijgg.20160403.13}, url = {https://doi.org/10.11648/j.ijgg.20160403.13}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijgg.20160403.13}, abstract = {Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.}, year = {2016} }
TY - JOUR T1 - The Method of a Gene Sequence Alignment BWT Index Based on Hadoop AU - Nan Li AU - Jing Gao AU - Bailong Feng Y1 - 2016/07/12 PY - 2016 N1 - https://doi.org/10.11648/j.ijgg.20160403.13 DO - 10.11648/j.ijgg.20160403.13 T2 - International Journal of Genetics and Genomics JF - International Journal of Genetics and Genomics JO - International Journal of Genetics and Genomics SP - 24 EP - 30 PB - Science Publishing Group SN - 2376-7359 UR - https://doi.org/10.11648/j.ijgg.20160403.13 AB - Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments. VL - 4 IS - 3 ER -