Performance Evaluation of Hybrid Bert Model on 
Code-mixed for Hausa-English Using Adapted Pre-trained Data

Ali Garba Jakwa; Faseki Ngozi Franscisca; Abubakar Yunusa Ahmad; Musa Ibrahim

doi:doi:10.11648/j.sdai.20260101.13

Research Article |

| Peer-Reviewed

Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data

Ali Garba Jakwa^*

, Faseki Ngozi Franscisca, Abubakar Yunusa Ahmad, Musa Ibrahim

Published in Science Discovery Artificial Intelligence (Volume 1, Issue 1)

Received: 8 October 2025 Accepted: 27 January 2026 Published: 21 February 2026

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.

Published in	Science Discovery Artificial Intelligence (Volume 1, Issue 1)
DOI	10.11648/j.sdai.20260101.13
Page(s)	14-26
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2026. Published by Science Publishing Group

Keywords

Multilingual, Machine Learning, Conversational Code-mixed, Local Large Language Model

References

[1]	Stein-Smith, K. (2022). Multilingualism And Its Purposes–Interdisciplinary Applications In Language Education And Advocacy. Journal of Languages for Spec (JLSP), 67.
[2]	Shashirekha, H. L., Balouchzahi, F., Anusha, M. D., & Sidorov, G. (2022). CoLI-machine learning approaches for code-mixed language identification at the word level in Kannada-English texts. arXiv preprint arXiv: 2211.09847.
[3]	Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T., & Barzilay, R. (2024). Conformal Language Modeling. 12th International Conference on Learning Representations, ICLR 2024, 1–28.
[4]	Chandu, K., Loginova, E., Gupta, V., Genabith, J. V., Neumann, G., Chinnakotla, M.,... & Black, A. W. (2019). Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Third Workshop on Computational Approaches to Linguistic Code-Switching (pp. 29-38). Association for Computational Linguistics (ACL).
[5]	Nayak, R., & Joshi, R. (2022). L3Cube-HingCorpus and HingBERT: A Code-Mixed Hindi-English Dataset and BERT Language Models. arXiv preprint arXiv: 2204.08398.
[6]	Takawane, G., Phaltankar, A., Patwardhan, V., Patil, A., Joshi, R., & Takalikar, M. S. (2023). Language augmentation approach for code-mixed text classification. Natural Language Processing Journal, 5(October), 100042. https://doi.org/10.1016/j.nlp.2023.100042
[7]	Zhang, Y., Shen, B., & Cao, X. (2022). Learn a prior question-aware feature for machine reading comprehension. Frontiers in Physics, 10, 1311.
[8]	Zhang, W., Majumdar, A., & Yadav, A. (2024). Code-mixed LLM: Improve Large Language Models’ Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback. http://arxiv.org/abs/2411.09073
[9]	Kumar, Gokul Karthik, Abhishek Singh Gehlot, Sahal Shaji Mullappilly, and Karthik Nandakumar. "Mucot: Multilingual contrastive training for question-answering in low-resource languages." arXiv preprint arXiv: 2204.05814 (2022).
[10]	Wanjawa, B. W., Wanzare, L. D., Indede, F., McOnyango, O., Muchemi, L., & Ombui, E. (2023). KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1-20.
[11]	Gupta, S., Rawat, B. P. S., & Yu, H. (2020). Conversational machine comprehension: a literature review. arXiv preprint arXiv: 2006.00671.
[12]	Balouchzahi, F., Butt, S., Hedge, A., Ashraf, N., Shashirekha, H. L., Sidorov, G., & Gelibukh, A. (2022). Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts (pp. 38-45).
[13]	McTear, M. (2021). Introducing Dialogue Systems. In Conversational AI: dialogue systems, conversational agents, and chatbots (pp. 11-42). Cham: Springer International Publishing.
[14]	Baradaran, R., Ghiasi, R., & Amirkhani, H. (2022). A survey on machine reading comprehension systems. Natural Language Engineering, 28(6), 683-732.
[15]	Gupta, D., Lenka, P., Ekbal, A., & Bhattacharyya, P. (2018, October). Uncovering code-mixed challenges: A framework for linguistically driven question generation and neural based question answering. In Proceedings of the 22nd Conference on Computational Natural Language Learning (pp. 119-130).
[16]	Dai, Y., Yu, H., Jiang, Y., Tang, C., Li, Y., & Sun, J. (2020). A survey on dialog management: Recent advances and challenges. arXiv preprint arXiv: 2005.02233.
[17]	Ram, A., Prasad, R., Khatri, C., Venkatesh, A., Gabriel, R., Liu, Q. & Pettigrue, A. (2018). Conversational ai: The science behind the alexa prize. arXiv preprint arXiv: 1801.03604.
[18]	Gupta, D. (2022). Learning to Answer Multilingual and Code-Mixed Questions. arXiv preprint arXiv: 2211.07522.
[19]	Nowakowski, K., Ptaszynski, M., Murasaki, K., & Nieuważny, J. (2023). Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining. Information Processing & Management, 60(2), 103148.
[20]	Alabi, B. (2022). Basic intuition of Conference Question Answering Systems (CQA), AR/VR Jouney: Augmented & Virtual Reality Magazine.
[21]	Zaib, M., Zhang, W. E., Sheng, Q. Z., Mahmood, A., & Zhang, Y. (2022). Conversational question answering: A survey. Knowledge and Information Systems, 64(12), 3151-3195.
[22]	Das, R., Ranjan, S., Pathak, S., & Jyothi, P. (2023, July). Improving Pretraining Techniques for Code-Switched NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1176-1191).
[23]	Patil, A., Patwardhan, V., Phaltankar, A., Takawane, G., & Joshi, R. (2023). Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data.
[24]	Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2021). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54, 755-810.
[25]	Gessler, L. (2023). Low-Resource Monolingual Transformer Language Models (Doctoral dissertation, Georgetown University).
[26]	Janarthanam, S. (2017). Hands-on chatbots and conversational UI development: build chatbots and voice user interfaces with Chatfuel, Dialogflow, Microsoft Bot Framework, Twilio, and Alexa Skills. Packt Publishing Ltd.
[27]	Joshi M, Choi E, Weld DS, Zettlemoyer L (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th annual meeting of the association for computational linguistics, Minneapolis, Minnesota, pp 1601–1611. https://doi.org/10.18653/v1/N19-1028
[28]	Phade, A., & Haribhakta, Y. (2021, November). Question Answering System for low resource language using Transfer Learning. In 2021 International Conference on Computational Intelligence and Computing Applications (ICCICA) (pp. 1-6). IEEE.
[29]	Raychawdhary, N., Das, A., Dozier, G., & Seals, C. D. (2023, July). Seals_Lab at SemEval-2023 Task 12: Sentiment Analysis for Low-resource African Languages, Hausa and Igbo. In Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp. 1508-1517).
[30]	Thara, S., Sampath, E., & Reddy, P. (2020, June). Code mixed question answering challenge using deep learning methods. In 2020 5th international conference on communication and electronics systems (ICCES) (pp. 1331-1337). IEEE.
[31]	Yusuf, A., Sarlan, A., Danyaro, K. U., & Rahman, A. S. B. (2023, August). Fine-tuning Multilingual Transformers for Hausa-English Sentiment Analysis. In 2023 13th International Conference on Information Technology in Asia (CITA) (pp. 13-18). IEEE.

Cite This Article

Plain Text BibTeX RIS

APA Style

Jakwa, A. G., Franscisca, F. N., Ahmad, A. Y., Ibrahim, M. (2026). Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Science Discovery Artificial Intelligence, 1(1), 14-26. https://doi.org/10.11648/j.sdai.20260101.13

Copy | Download

ACS Style

Jakwa, A. G.; Franscisca, F. N.; Ahmad, A. Y.; Ibrahim, M. Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Sci. Discov. Artif. Intell. 2026, 1(1), 14-26. doi: 10.11648/j.sdai.20260101.13

Copy | Download

AMA Style

Jakwa AG, Franscisca FN, Ahmad AY, Ibrahim M. Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Sci Discov Artif Intell. 2026;1(1):14-26. doi: 10.11648/j.sdai.20260101.13

Copy | Download

@article{10.11648/j.sdai.20260101.13,
  author = {Ali Garba Jakwa and Faseki Ngozi Franscisca and Abubakar Yunusa Ahmad and Musa Ibrahim},
  title = {Performance Evaluation of Hybrid Bert Model on 
Code-mixed for Hausa-English Using Adapted Pre-trained Data},
  journal = {Science Discovery Artificial Intelligence},
  volume = {1},
  number = {1},
  pages = {14-26},
  doi = {10.11648/j.sdai.20260101.13},
  url = {https://doi.org/10.11648/j.sdai.20260101.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sdai.20260101.13},
  abstract = {This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.},
 year = {2026}
}

Copy | Download

TY  - JOUR
T1  - Performance Evaluation of Hybrid Bert Model on 
Code-mixed for Hausa-English Using Adapted Pre-trained Data
AU  - Ali Garba Jakwa
AU  - Faseki Ngozi Franscisca
AU  - Abubakar Yunusa Ahmad
AU  - Musa Ibrahim
Y1  - 2026/02/21
PY  - 2026
N1  - https://doi.org/10.11648/j.sdai.20260101.13
DO  - 10.11648/j.sdai.20260101.13
T2  - Science Discovery Artificial Intelligence
JF  - Science Discovery Artificial Intelligence
JO  - Science Discovery Artificial Intelligence
SP  - 14
EP  - 26
PB  - Science Publishing Group
UR  - https://doi.org/10.11648/j.sdai.20260101.13
AB  - This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.
VL  - 1
IS  - 1
ER  -

Copy | Download

Author Information

Ali Garba Jakwa

Department of Computer Science, Nigerian Army University Biu, Biu, Nigeria

Contact Email

http://orcid.org/0000-0003-1044-2550
Faseki Ngozi Franscisca

Department of Computer Science, Nigerian Army University Biu, Biu, Nigeria
Abubakar Yunusa Ahmad

Department of Computer Science, Nigerian Army University Biu, Biu, Nigeria
Musa Ibrahim

Department of Cyber Security, Nigerian Army University Biu, Biu, Nigeria

Download PDF

Submit an Article

Sections

Plain Text BibTeX RIS

APA Style

Jakwa, A. G., Franscisca, F. N., Ahmad, A. Y., Ibrahim, M. (2026). Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Science Discovery Artificial Intelligence, 1(1), 14-26. https://doi.org/10.11648/j.sdai.20260101.13

Copy | Download

ACS Style

Jakwa, A. G.; Franscisca, F. N.; Ahmad, A. Y.; Ibrahim, M. Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Sci. Discov. Artif. Intell. 2026, 1(1), 14-26. doi: 10.11648/j.sdai.20260101.13

Copy | Download

AMA Style

Jakwa AG, Franscisca FN, Ahmad AY, Ibrahim M. Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Sci Discov Artif Intell. 2026;1(1):14-26. doi: 10.11648/j.sdai.20260101.13

Copy | Download

@article{10.11648/j.sdai.20260101.13,
  author = {Ali Garba Jakwa and Faseki Ngozi Franscisca and Abubakar Yunusa Ahmad and Musa Ibrahim},
  title = {Performance Evaluation of Hybrid Bert Model on 
Code-mixed for Hausa-English Using Adapted Pre-trained Data},
  journal = {Science Discovery Artificial Intelligence},
  volume = {1},
  number = {1},
  pages = {14-26},
  doi = {10.11648/j.sdai.20260101.13},
  url = {https://doi.org/10.11648/j.sdai.20260101.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sdai.20260101.13},
  abstract = {This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.},
 year = {2026}
}

Copy | Download

TY  - JOUR
T1  - Performance Evaluation of Hybrid Bert Model on 
Code-mixed for Hausa-English Using Adapted Pre-trained Data
AU  - Ali Garba Jakwa
AU  - Faseki Ngozi Franscisca
AU  - Abubakar Yunusa Ahmad
AU  - Musa Ibrahim
Y1  - 2026/02/21
PY  - 2026
N1  - https://doi.org/10.11648/j.sdai.20260101.13
DO  - 10.11648/j.sdai.20260101.13
T2  - Science Discovery Artificial Intelligence
JF  - Science Discovery Artificial Intelligence
JO  - Science Discovery Artificial Intelligence
SP  - 14
EP  - 26
PB  - Science Publishing Group
UR  - https://doi.org/10.11648/j.sdai.20260101.13
AB  - This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.
VL  - 1
IS  - 1
ER  -

Copy | Download