This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.
| Published in | Science Discovery Artificial Intelligence (Volume 1, Issue 1) |
| DOI | 10.11648/j.sdai.20260101.13 |
| Page(s) | 14-26 |
| Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
| Copyright |
Copyright © The Author(s), 2026. Published by Science Publishing Group |
Multilingual, Machine Learning, Conversational Code-mixed, Local Large Language Model
| [1] | Stein-Smith, K. (2022). Multilingualism And Its Purposes–Interdisciplinary Applications In Language Education And Advocacy. Journal of Languages for Spec (JLSP), 67. |
| [2] | Shashirekha, H. L., Balouchzahi, F., Anusha, M. D., & Sidorov, G. (2022). CoLI-machine learning approaches for code-mixed language identification at the word level in Kannada-English texts. arXiv preprint arXiv: 2211.09847. |
| [3] | Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T., & Barzilay, R. (2024). Conformal Language Modeling. 12th International Conference on Learning Representations, ICLR 2024, 1–28. |
| [4] | Chandu, K., Loginova, E., Gupta, V., Genabith, J. V., Neumann, G., Chinnakotla, M.,... & Black, A. W. (2019). Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Third Workshop on Computational Approaches to Linguistic Code-Switching (pp. 29-38). Association for Computational Linguistics (ACL). |
| [5] | Nayak, R., & Joshi, R. (2022). L3Cube-HingCorpus and HingBERT: A Code-Mixed Hindi-English Dataset and BERT Language Models. arXiv preprint arXiv: 2204.08398. |
| [6] | Takawane, G., Phaltankar, A., Patwardhan, V., Patil, A., Joshi, R., & Takalikar, M. S. (2023). Language augmentation approach for code-mixed text classification. Natural Language Processing Journal, 5(October), 100042. |
| [7] | Zhang, Y., Shen, B., & Cao, X. (2022). Learn a prior question-aware feature for machine reading comprehension. Frontiers in Physics, 10, 1311. |
| [8] | Zhang, W., Majumdar, A., & Yadav, A. (2024). Code-mixed LLM: Improve Large Language Models’ Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback. |
| [9] | Kumar, Gokul Karthik, Abhishek Singh Gehlot, Sahal Shaji Mullappilly, and Karthik Nandakumar. "Mucot: Multilingual contrastive training for question-answering in low-resource languages." arXiv preprint arXiv: 2204.05814 (2022). |
| [10] | Wanjawa, B. W., Wanzare, L. D., Indede, F., McOnyango, O., Muchemi, L., & Ombui, E. (2023). KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1-20. |
| [11] | Gupta, S., Rawat, B. P. S., & Yu, H. (2020). Conversational machine comprehension: a literature review. arXiv preprint arXiv: 2006.00671. |
| [12] | Balouchzahi, F., Butt, S., Hedge, A., Ashraf, N., Shashirekha, H. L., Sidorov, G., & Gelibukh, A. (2022). Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts (pp. 38-45). |
| [13] | McTear, M. (2021). Introducing Dialogue Systems. In Conversational AI: dialogue systems, conversational agents, and chatbots (pp. 11-42). Cham: Springer International Publishing. |
| [14] | Baradaran, R., Ghiasi, R., & Amirkhani, H. (2022). A survey on machine reading comprehension systems. Natural Language Engineering, 28(6), 683-732. |
| [15] | Gupta, D., Lenka, P., Ekbal, A., & Bhattacharyya, P. (2018, October). Uncovering code-mixed challenges: A framework for linguistically driven question generation and neural based question answering. In Proceedings of the 22nd Conference on Computational Natural Language Learning (pp. 119-130). |
| [16] | Dai, Y., Yu, H., Jiang, Y., Tang, C., Li, Y., & Sun, J. (2020). A survey on dialog management: Recent advances and challenges. arXiv preprint arXiv: 2005.02233. |
| [17] | Ram, A., Prasad, R., Khatri, C., Venkatesh, A., Gabriel, R., Liu, Q. & Pettigrue, A. (2018). Conversational ai: The science behind the alexa prize. arXiv preprint arXiv: 1801.03604. |
| [18] | Gupta, D. (2022). Learning to Answer Multilingual and Code-Mixed Questions. arXiv preprint arXiv: 2211.07522. |
| [19] | Nowakowski, K., Ptaszynski, M., Murasaki, K., & Nieuważny, J. (2023). Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining. Information Processing & Management, 60(2), 103148. |
| [20] | Alabi, B. (2022). Basic intuition of Conference Question Answering Systems (CQA), AR/VR Jouney: Augmented & Virtual Reality Magazine. |
| [21] | Zaib, M., Zhang, W. E., Sheng, Q. Z., Mahmood, A., & Zhang, Y. (2022). Conversational question answering: A survey. Knowledge and Information Systems, 64(12), 3151-3195. |
| [22] | Das, R., Ranjan, S., Pathak, S., & Jyothi, P. (2023, July). Improving Pretraining Techniques for Code-Switched NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1176-1191). |
| [23] | Patil, A., Patwardhan, V., Phaltankar, A., Takawane, G., & Joshi, R. (2023). Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data. |
| [24] | Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2021). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54, 755-810. |
| [25] | Gessler, L. (2023). Low-Resource Monolingual Transformer Language Models (Doctoral dissertation, Georgetown University). |
| [26] | Janarthanam, S. (2017). Hands-on chatbots and conversational UI development: build chatbots and voice user interfaces with Chatfuel, Dialogflow, Microsoft Bot Framework, Twilio, and Alexa Skills. Packt Publishing Ltd. |
| [27] | Joshi M, Choi E, Weld DS, Zettlemoyer L (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th annual meeting of the association for computational linguistics, Minneapolis, Minnesota, pp 1601–1611. |
| [28] | Phade, A., & Haribhakta, Y. (2021, November). Question Answering System for low resource language using Transfer Learning. In 2021 International Conference on Computational Intelligence and Computing Applications (ICCICA) (pp. 1-6). IEEE. |
| [29] | Raychawdhary, N., Das, A., Dozier, G., & Seals, C. D. (2023, July). Seals_Lab at SemEval-2023 Task 12: Sentiment Analysis for Low-resource African Languages, Hausa and Igbo. In Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp. 1508-1517). |
| [30] | Thara, S., Sampath, E., & Reddy, P. (2020, June). Code mixed question answering challenge using deep learning methods. In 2020 5th international conference on communication and electronics systems (ICCES) (pp. 1331-1337). IEEE. |
| [31] | Yusuf, A., Sarlan, A., Danyaro, K. U., & Rahman, A. S. B. (2023, August). Fine-tuning Multilingual Transformers for Hausa-English Sentiment Analysis. In 2023 13th International Conference on Information Technology in Asia (CITA) (pp. 13-18). IEEE. |
APA Style
Jakwa, A. G., Franscisca, F. N., Ahmad, A. Y., Ibrahim, M. (2026). Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Science Discovery Artificial Intelligence, 1(1), 14-26. https://doi.org/10.11648/j.sdai.20260101.13
ACS Style
Jakwa, A. G.; Franscisca, F. N.; Ahmad, A. Y.; Ibrahim, M. Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data. Sci. Discov. Artif. Intell. 2026, 1(1), 14-26. doi: 10.11648/j.sdai.20260101.13
@article{10.11648/j.sdai.20260101.13,
author = {Ali Garba Jakwa and Faseki Ngozi Franscisca and Abubakar Yunusa Ahmad and Musa Ibrahim},
title = {Performance Evaluation of Hybrid Bert Model on
Code-mixed for Hausa-English Using Adapted Pre-trained Data},
journal = {Science Discovery Artificial Intelligence},
volume = {1},
number = {1},
pages = {14-26},
doi = {10.11648/j.sdai.20260101.13},
url = {https://doi.org/10.11648/j.sdai.20260101.13},
eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sdai.20260101.13},
abstract = {This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity.},
year = {2026}
}
TY - JOUR T1 - Performance Evaluation of Hybrid Bert Model on Code-mixed for Hausa-English Using Adapted Pre-trained Data AU - Ali Garba Jakwa AU - Faseki Ngozi Franscisca AU - Abubakar Yunusa Ahmad AU - Musa Ibrahim Y1 - 2026/02/21 PY - 2026 N1 - https://doi.org/10.11648/j.sdai.20260101.13 DO - 10.11648/j.sdai.20260101.13 T2 - Science Discovery Artificial Intelligence JF - Science Discovery Artificial Intelligence JO - Science Discovery Artificial Intelligence SP - 14 EP - 26 PB - Science Publishing Group UR - https://doi.org/10.11648/j.sdai.20260101.13 AB - This research evaluates the potentials of using BERT (Bidirectional Encoder Representations from Transformers) language model on code-mixed for English-Hausa Language code-mixed using adapted pre-trained dataset. The main aim of this research was to unveil the potential benefits of using pre-trained models for handling code-mixed data to improved language understanding and context sensitivity in relation to Hausa-English-Language, the objective of this research was achieved by developing a BERT model that is capable of handling Hausa-English code-mixed dataset exploring different machine learning language models by training the chosen model with the adapted English-Hausa Language code-mixed. What necessitates this research was due to low data corpus on the language domain of Hausa-English code-mixed while other languages were explored like English-Hindu Code-Mixed. The model was developed using python transformer library. The adapted pre-trained dataset was first pre-processed, tokenized and fine-tuned in order to fit into the BERT model, the dataset was normalized in the context of code-mixed conversation based on annotate language labels to distinguish between English and Hausa Language segments in the code-mixed text, appropriate parameter for training were set with different optimization strategies for fine-tuning, adjusted learning rate, batch sizes and training epochs for performance optimization. The model was evaluated based on accuracy, F1-score, precision and recall for Code-Mixed tasks, the results of HauBERT our proposed model showed more than 90% accuracy, the result was compared with state-of-the-art BERT language models, and the study recommended that this adapted pre-trained model should be applied in large language model for language understanding and context sensitivity. VL - 1 IS - 1 ER -