Research Article | | Peer-Reviewed

Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data

Received: 8 October 2025     Accepted: 27 January 2026     Published: 21 February 2026
Views:       Downloads:
Abstract

This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.

Published in Science Discovery Artificial Intelligence (Volume 1, Issue 1)
DOI 10.11648/j.sdai.20260101.13
Page(s) 14-26
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2026. Published by Science Publishing Group

Keywords

Multilingual, Machine Learning, Conversational Code-mixed, Local Large Language Model

References
[1] Stein-Smith, K. (2022). Multilingualism And Its Purposes–Interdisciplinary Applications In Language Education And Advocacy. Journal of Languages for Spec (JLSP), 67.
[2] Shashirekha, H. L., Balouchzahi, F., Anusha, M. D., & Sidorov, G. (2022). CoLI-machine learning approaches for code-mixed language identification at the word level in Kannada-English texts. arXiv preprint arXiv: 2211.09847.
[3] Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T., & Barzilay, R. (2024). Conformal Language Modeling. 12th International Conference on Learning Representations, ICLR 2024, 1–28.
[4] Chandu, K., Loginova, E., Gupta, V., Genabith, J. V., Neumann, G., Chinnakotla, M.,... & Black, A. W. (2019). Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Third Workshop on Computational Approaches to Linguistic Code-Switching (pp. 29-38). Association for Computational Linguistics (ACL).
[5] Nayak, R., & Joshi, R. (2022). L3Cube-HingCorpus and HingBERT: A Code-Mixed Hindi-English Dataset and BERT Language Models. arXiv preprint arXiv: 2204.08398.
[6] Takawane, G., Phaltankar, A., Patwardhan, V., Patil, A., Joshi, R., & Takalikar, M. S. (2023). Language augmentation approach for code-mixed text classification. Natural Language Processing Journal, 5(October), 100042.
[7] Zhang, Y., Shen, B., & Cao, X. (2022). Learn a prior question-aware feature for machine reading comprehension. Frontiers in Physics, 10, 1311.
[8] Zhang, W., Majumdar, A., & Yadav, A. (2024). Code-mixed LLM: Improve Large Language Models’ Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback.
[9] Kumar, Gokul Karthik, Abhishek Singh Gehlot, Sahal Shaji Mullappilly, and Karthik Nandakumar. "Mucot: Multilingual contrastive training for question-answering in low-resource languages." arXiv preprint arXiv: 2204.05814 (2022).
[10] Wanjawa, B. W., Wanzare, L. D., Indede, F., McOnyango, O., Muchemi, L., & Ombui, E. (2023). KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1-20.
[11] Gupta, S., Rawat, B. P. S., & Yu, H. (2020). Conversational machine comprehension: a literature review. arXiv preprint arXiv: 2006.00671.
[12] Balouchzahi, F., Butt, S., Hedge, A., Ashraf, N., Shashirekha, H. L., Sidorov, G., & Gelibukh, A. (2022). Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts (pp. 38-45).
[13] McTear, M. (2021). Introducing Dialogue Systems. In Conversational AI: dialogue systems, conversational agents, and chatbots (pp. 11-42). Cham: Springer International Publishing.
[14] Baradaran, R., Ghiasi, R., & Amirkhani, H. (2022). A survey on machine reading comprehension systems. Natural Language Engineering, 28(6), 683-732.
[15] Gupta, D., Lenka, P., Ekbal, A., & Bhattacharyya, P. (2018, October). Uncovering code-mixed challenges: A framework for linguistically driven question generation and neural based question answering. In Proceedings of the 22nd Conference on Computational Natural Language Learning (pp. 119-130).
[16] Dai, Y., Yu, H., Jiang, Y., Tang, C., Li, Y., & Sun, J. (2020). A survey on dialog management: Recent advances and challenges. arXiv preprint arXiv: 2005.02233.
[17] Ram, A., Prasad, R., Khatri, C., Venkatesh, A., Gabriel, R., Liu, Q. & Pettigrue, A. (2018). Conversational ai: The science behind the alexa prize. arXiv preprint arXiv: 1801.03604.
[18] Gupta, D. (2022). Learning to Answer Multilingual and Code-Mixed Questions. arXiv preprint arXiv: 2211.07522.
[19] Nowakowski, K., Ptaszynski, M., Murasaki, K., & Nieuważny, J. (2023). Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining. Information Processing & Management, 60(2), 103148.
[20] Alabi, B. (2022). Basic intuition of Conference Question Answering Systems (CQA), AR/VR Jouney: Augmented & Virtual Reality Magazine.
[21] Zaib, M., Zhang, W. E., Sheng, Q. Z., Mahmood, A., & Zhang, Y. (2022). Conversational question answering: A survey. Knowledge and Information Systems, 64(12), 3151-3195.
[22] Das, R., Ranjan, S., Pathak, S., & Jyothi, P. (2023, July). Improving Pretraining Techniques for Code-Switched NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1176-1191).
[23] Patil, A., Patwardhan, V., Phaltankar, A., Takawane, G., & Joshi, R. (2023). Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data.
[24] Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2021). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54, 755-810.
[25] Gessler, L. (2023). Low-Resource Monolingual Transformer Language Models (Doctoral dissertation, Georgetown University).
[26] Janarthanam, S. (2017). Hands-on chatbots and conversational UI development: build chatbots and voice user interfaces with Chatfuel, Dialogflow, Microsoft Bot Framework, Twilio, and Alexa Skills. Packt Publishing Ltd.
[27] Joshi M, Choi E, Weld DS, Zettlemoyer L (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th annual meeting of the association for computational linguistics, Minneapolis, Minnesota, pp 1601–1611.
[28] Phade, A., & Haribhakta, Y. (2021, November). Question Answering System for low resource language using Transfer Learning. In 2021 International Conference on Computational Intelligence and Computing Applications (ICCICA) (pp. 1-6). IEEE.
[29] Raychawdhary, N., Das, A., Dozier, G., & Seals, C. D. (2023, July). Seals_Lab at SemEval-2023 Task 12: Sentiment Analysis for Low-resource African Languages, Hausa and Igbo. In Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp. 1508-1517).
[30] Thara, S., Sampath, E., & Reddy, P. (2020, June). Code mixed question answering challenge using deep learning methods. In 2020 5th international conference on communication and electronics systems (ICCES) (pp. 1331-1337). IEEE.
[31] Yusuf, A., Sarlan, A., Danyaro, K. U., & Rahman, A. S. B. (2023, August). Fine-tuning Multilingual Transformers for Hausa-English Sentiment Analysis. In 2023 13th International Conference on Information Technology in Asia (CITA) (pp. 13-18). IEEE.
Cite This Article
  • APA Style

    Jakwa, A. G., Franscisca, F. N., Ahmad, A. Y., Ibrahim, M. (2026). Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Science Discovery Artificial Intelligence, 1(1), 14-26. https://doi.org/10.11648/j.sdai.20260101.13

    Copy | Download

    ACS Style

    Jakwa, A. G.; Franscisca, F. N.; Ahmad, A. Y.; Ibrahim, M. Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Sci. Discov. Artif. Intell. 2026, 1(1), 14-26. doi: 10.11648/j.sdai.20260101.13

    Copy | Download

    AMA Style

    Jakwa AG, Franscisca FN, Ahmad AY, Ibrahim M. Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Sci Discov Artif Intell. 2026;1(1):14-26. doi: 10.11648/j.sdai.20260101.13

    Copy | Download

  • @article{10.11648/j.sdai.20260101.13,
      author = {Ali Garba Jakwa and Faseki Ngozi Franscisca and Abubakar Yunusa Ahmad and Musa Ibrahim},
      title = {Performance Evaluation of Hybrid Bert Model on 
    Code-mixed for Hausa-English Using Adapted Pre-trained Data},
      journal = {Science Discovery Artificial Intelligence},
      volume = {1},
      number = {1},
      pages = {14-26},
      doi = {10.11648/j.sdai.20260101.13},
      url = {https://doi.org/10.11648/j.sdai.20260101.13},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sdai.20260101.13},
      abstract = {This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.},
     year = {2026}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Performance Evaluation of Hybrid Bert Model on 
    Code-mixed for Hausa-English Using Adapted Pre-trained Data
    AU  - Ali Garba Jakwa
    AU  - Faseki Ngozi Franscisca
    AU  - Abubakar Yunusa Ahmad
    AU  - Musa Ibrahim
    Y1  - 2026/02/21
    PY  - 2026
    N1  - https://doi.org/10.11648/j.sdai.20260101.13
    DO  - 10.11648/j.sdai.20260101.13
    T2  - Science Discovery Artificial Intelligence
    JF  - Science Discovery Artificial Intelligence
    JO  - Science Discovery Artificial Intelligence
    SP  - 14
    EP  - 26
    PB  - Science Publishing Group
    UR  - https://doi.org/10.11648/j.sdai.20260101.13
    AB  - This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.
    VL  - 1
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Department of Computer Science, Nigerian Army University Biu, Biu, Nigeria

  • Department of Computer Science, Nigerian Army University Biu, Biu, Nigeria

  • Department of Computer Science, Nigerian Army University Biu, Biu, Nigeria

  • Department of Cyber Security, Nigerian Army University Biu, Biu, Nigeria

  • Sections