| Peer-Reviewed

Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study

Received: 29 May 2021     Accepted: 21 June 2021     Published: 9 July 2021
Views:       Downloads:
Abstract

With the drastic development of computing technologies, there is an ever-increasing trend in the growth of data. Data scientists are overwhelmed with such a large and ever-increasing amount of data, as this now requires more processing channels. The big concern arising here for large-scale data is to provide support for the decision making process. Here in this study, the MapReduce programming model is applied, an associated implementation introduced by Google. This programming model involves the computation of two functions; Map and Reduce. The MapReduce libraries automatically parallelize the computation and handle complex tasks including big data distribution, loads and fault tolerance. This MapReduce implementation with the source formation of Google and the open-source mechanism, Hadoop has an objective of handling computation of large clusters of commodities. Our implication of MapReduce and Hadoop framework is aimed at discussing terabytes and petabytes of storage with thousands of machines parallel to every machine and process at identical times. This way, large processing and manipulation of big data are maintained with effective result orientations. This study will show up the basics of MapReduce programming and open-source Hadoop structure application. The Hadoop system can speed up the handling of big data and respond very fast.

Published in Advances in Applied Sciences (Volume 6, Issue 3)
DOI 10.11648/j.aas.20210603.11
Page(s) 43-48
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2021. Published by Science Publishing Group

Keywords

Google MapReduce Processes, Hadoop, Parallel Data Processing, HDFS, Cloud Computing, Large Cluster Data Processing

References
[1] G. Z. &. C. B. Jason R Swedlow, "Channeling the data deluge," Nature methods, vol. 8, p. 463–465, 2011.
[2] J. Maitrey S, "An Integrated Approach for CURE Clustering using Map-Reduce Techniques," In Proceedings of Elsevier, vol. 2, 2013.
[3] D. D, "MapReduce: A major step backwards," The Database Column, 2011.
[4] Y. Kim and K. Shim, "Parallel Top-K Similarity Join Algorithms Using MapReduce," Arlington, VA, USA, 2012.
[5] J. Shafer, S. Rixner and A. L. Cox, "The Hadoop distributed filesystem: Balancing portability and performance," White Plains, NY, USA, 2010.
[6] S. M. CA Moturi, "Use of MapReduce for Data Mining and Data Optimization on a Web Portal," International Journal of Computer, vol. 56, no. 7, 2012.
[7] C. J. Seema Maitreya, "MapReduce: Simplified Data Analysis of Big Data," Procedia Computer Science, vol. 57, pp. 563-571, 2015.
[8] S. G. Jeffrey Dean, "MapReduce: Simplified Data Processing on Large Clusters," USENIX Association OSDI, vol. 4, pp. 137-149, 2004.
[9] R. M. Yoo, A. Romano and C. Kozyrakis, "Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system," Austin, TX, USA, 2009.
[10] H. C. Y. D. C. B. M. Kyong-Ha Lee, "Parallel data processing with MapReduce: a survey," ACM SIGMOD Record, vol. 40, no. 4, 2012.
[11] B. P. J. S. H. S. B. R. J. Bayardo, "PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce," PVLDB, vol. 2, no. 2, pp. 1426-1437, 2009.
[12] S. G. Jeffrey Dean, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, 2008.
[13] J. Ekanayake, S. Pallickara and G. Fox, "MapReduce for Data Intensive Scientific Analyses," Indianapolis, IN, USA, 2008.
[14] A. Alam and J. Ahmed, "Hadoop Architecture and Its Issues," Las Vegas, NV, USA, 2014.
[15] R. K. R. R. Vijayakumari, "Comparative analysis of Google File System and Hadoop Distributed File System," International Journal of Advanced Trends in Computer Science and Engineering, vol. 3, no. 1, pp. 553-558, 2014.
[16] J. J. B. X. Y. F. Wang, "Hadoop high availability through metadata replication”, in Proc," The first international workshop on Cloud data management, pp. 37-44, 2009.
[17] A. D. R.-L. H. D. S. P. Hung-chih Yang, "Map-reduce-merge: simplified relational data processing on large clusters," 2007.
Cite This Article
  • APA Style

    Abdiaziz Omar Hassan, Abdulkadir Abdulahi Hasan. (2021). Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study. Advances in Applied Sciences, 6(3), 43-48. https://doi.org/10.11648/j.aas.20210603.11

    Copy | Download

    ACS Style

    Abdiaziz Omar Hassan; Abdulkadir Abdulahi Hasan. Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study. Adv. Appl. Sci. 2021, 6(3), 43-48. doi: 10.11648/j.aas.20210603.11

    Copy | Download

    AMA Style

    Abdiaziz Omar Hassan, Abdulkadir Abdulahi Hasan. Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study. Adv Appl Sci. 2021;6(3):43-48. doi: 10.11648/j.aas.20210603.11

    Copy | Download

  • @article{10.11648/j.aas.20210603.11,
      author = {Abdiaziz Omar Hassan and Abdulkadir Abdulahi Hasan},
      title = {Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study},
      journal = {Advances in Applied Sciences},
      volume = {6},
      number = {3},
      pages = {43-48},
      doi = {10.11648/j.aas.20210603.11},
      url = {https://doi.org/10.11648/j.aas.20210603.11},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.aas.20210603.11},
      abstract = {With the drastic development of computing technologies, there is an ever-increasing trend in the growth of data. Data scientists are overwhelmed with such a large and ever-increasing amount of data, as this now requires more processing channels. The big concern arising here for large-scale data is to provide support for the decision making process. Here in this study, the MapReduce programming model is applied, an associated implementation introduced by Google. This programming model involves the computation of two functions; Map and Reduce. The MapReduce libraries automatically parallelize the computation and handle complex tasks including big data distribution, loads and fault tolerance. This MapReduce implementation with the source formation of Google and the open-source mechanism, Hadoop has an objective of handling computation of large clusters of commodities. Our implication of MapReduce and Hadoop framework is aimed at discussing terabytes and petabytes of storage with thousands of machines parallel to every machine and process at identical times. This way, large processing and manipulation of big data are maintained with effective result orientations. This study will show up the basics of MapReduce programming and open-source Hadoop structure application. The Hadoop system can speed up the handling of big data and respond very fast.},
     year = {2021}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study
    AU  - Abdiaziz Omar Hassan
    AU  - Abdulkadir Abdulahi Hasan
    Y1  - 2021/07/09
    PY  - 2021
    N1  - https://doi.org/10.11648/j.aas.20210603.11
    DO  - 10.11648/j.aas.20210603.11
    T2  - Advances in Applied Sciences
    JF  - Advances in Applied Sciences
    JO  - Advances in Applied Sciences
    SP  - 43
    EP  - 48
    PB  - Science Publishing Group
    SN  - 2575-1514
    UR  - https://doi.org/10.11648/j.aas.20210603.11
    AB  - With the drastic development of computing technologies, there is an ever-increasing trend in the growth of data. Data scientists are overwhelmed with such a large and ever-increasing amount of data, as this now requires more processing channels. The big concern arising here for large-scale data is to provide support for the decision making process. Here in this study, the MapReduce programming model is applied, an associated implementation introduced by Google. This programming model involves the computation of two functions; Map and Reduce. The MapReduce libraries automatically parallelize the computation and handle complex tasks including big data distribution, loads and fault tolerance. This MapReduce implementation with the source formation of Google and the open-source mechanism, Hadoop has an objective of handling computation of large clusters of commodities. Our implication of MapReduce and Hadoop framework is aimed at discussing terabytes and petabytes of storage with thousands of machines parallel to every machine and process at identical times. This way, large processing and manipulation of big data are maintained with effective result orientations. This study will show up the basics of MapReduce programming and open-source Hadoop structure application. The Hadoop system can speed up the handling of big data and respond very fast.
    VL  - 6
    IS  - 3
    ER  - 

    Copy | Download

Author Information
  • College of Mathematics and Big Data, Anhui University of Science and Technology, Huainan, China

  • College of Mathematics and Big Data, Anhui University of Science and Technology, Huainan, China

  • Sections