Expanded Entity Coverage and Machine-Annotated Pre-Training for Urdu Named Entity Recognition
DOI:
https://doi.org/10.62019/gypnn177Keywords:
Automatic, text summarization, natural language processing, deep learning.Abstract
Named Entity Recognition (NER) for low-resource languages remains challenging due to limited annotated corpora and complex linguistic characteristics. Urdu is a morphologically rich Indo-Aryan language written in a cursive right-to-left script, increases these challenges in contrast with high-resource languages in NER performance. This paper presents a data-efficient framework for Urdu NER that combines large-scale machine-generated annotations with high-quality human-annotated data to mitigate annotation scarcity. Firstly, a new gold-standard human-annotated Urdu NER corpus is built comprising 49,040 tokens and 5,839 named entities across 13 fine-grained entity categories, including newly introduced types such as Sports, Food, and Color. To complement this dataset, a large machine-annotated corpus was created using a bootstrapped ensemble of Conditional Random Field models with confidence-based filtering. A two-stage training strategy in which multilingual transformer models, Multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R) are pre-trained on the machine-annotated corpus and subsequently fine-tuned on the human-annotated data. Experimental results demonstrate that machine-annotation pre-training consistently improves NER performance for both models, yielding micro-averaged F1-score gains from 0.84 to 0.86 for mBERT and from 0.86 to 0.88 for XLM-R. Detailed per-class and confusion matrix analyses further show notable improvements for low-frequency and specialized entity types. The findings confirm that integrating weakly supervised machine annotation with multilingual transformer-based learning provides a practical and scalable solution for improving NER in Urdu and other low-resource languages.
Keywords: NER, machine-annotated corpus, low-resource languages, Urdu NER corpus
References
Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459. DOI: https://doi.org/10.1002/wics.101
Ahmed, A., Huang, D., & Arafat, S. Y. (2024). Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(4). https://doi.org/10.1145/3648362 DOI: https://doi.org/10.1145/3648362
Anam, R., Anwar, M. W., Jamal, M. H., Bajwa, U. I., De la Torre Diez, I., Alvarado, E. S., Flores, E. S., & Ashraf, I. (2024). A deep learning approach for Named Entity Recognition in Urdu language. PLoS ONE, 19(3 March). https://doi.org/10.1371/journal.pone.0300725 DOI: https://doi.org/10.1371/journal.pone.0300725
Aziz, K., Ahmed, N., Yu, Y., Hadi, H. J., Alshara, M. A., Tariq, U., & Ji, D. (2025). Advancing Urdu named entity recognition: deep learning for aspect targeting. Complex & Intelligent Systems, 11(12), 489. DOI: https://doi.org/10.1007/s40747-025-02066-6
Basir, N., Hakro, D. N., Khoumbati, K. U. R., & Bhatti, Z. (2025). Leveraging machine-labeled data and cross-lingual transfer for NER in Urdu and sindhi. J. Inf. Commun. Technol.—(JICT), 19, 1–8. DOI: https://doi.org/10.1109/ICET66147.2025.11321232
Chen, S., Pei, Y., Ke, Z., & Silamu, W. (2021). Low-resource named entity recognition via the pre-training model. Symmetry, 13(5), 786. DOI: https://doi.org/10.3390/sym13050786
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised Cross-lingual Representation Learning at Scale. ArXiv Preprint ArXiv:1911.02116. https://github.com/facebookresearch/cc DOI: https://doi.org/10.18653/v1/2020.acl-main.747
Daud, A., Khan, W., & Che, D. (2017). Urdu language processing: a survey. Artificial Intelligence Review, 47(3), 279–311. DOI: https://doi.org/10.1007/s10462-016-9482-x
Farrugia, K., & Wahlberg, F. (2022). Multilingual Transformer Models for Maltese Named Entity Recognition. Uppsala University.
Gligic, L., Kormilitzin, A., Goldberg, P., & Nevado-Holgado, A. (2020). Named Entity Recognition in Electronic Health Records Using Transfer Learning Bootstrapped Neural Networks. Neural Networks, 121, 132–139. DOI: https://doi.org/10.1016/j.neunet.2019.08.032
Jahangir, F., Anwar, W., Ijaz Bajwa, U., & Wang, X. (2012). N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language. Proceedings of the 10th Workshop on Asian Language Resources, 95–104.
Jain, S. M. (2022). Hugging face. In Introduction to transformers for NLP: With the hugging face library and models to solve problems (pp. 51–67). Springer. DOI: https://doi.org/10.1007/978-1-4842-8844-3_4
Jiang, J., Shu, Y., Wang, J., & Long, M. (2022). Transferability in deep learning: A survey. ArXiv Preprint ArXiv:2201.05867.
Kamran Malik, M., & Mansoor Sarwar, S. (2016). Named Entity Recognition System for Postpositional Languages: Urdu as a Case Study. (IJACSA) International Journal of Advanced Computer Science and Applications, 7(10). www.ijacsa.thesai.org DOI: https://doi.org/10.14569/IJACSA.2016.071019
Kazi, S., Rahim, M., & Khoja, S. (2023). A deep learning approach to building a framework for Urdu POS and NER. Journal of Intelligent and Fuzzy Systems, 44(2), 3341–3351. https://doi.org/10.3233/JIFS-211275 DOI: https://doi.org/10.3233/JIFS-211275
Khairunnisa, S. O., Chen, Z., & Komachi, M. (2023). Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(6), 1–21. https://doi.org/10.1145/3592854 DOI: https://doi.org/10.1145/3592854
Khan, W., Daud, A., Alotaibi, F., Aljohani, N., & Arafat, S. (2020). Deep recurrent neural networks with word embeddings for Urdu named entity recognition. ETRI Journal, 42(1), 90–100. https://doi.org/10.4218/etrij.2018-0553 DOI: https://doi.org/10.4218/etrij.2018-0553
Kim, J., Ko, Y., & Seo, J. (2020). Construction of Machine-Labeled Data for Improving Named Entity Recognition by Transfer Learning. IEEE Access, 8, 59684–59693. https://doi.org/10.1109/ACCESS.2020.2981361 DOI: https://doi.org/10.1109/ACCESS.2020.2981361
Kim, J., Kwon, S., Ko, Y., & Seo, J. (2017). A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains. Roceedings of the International Workshop on Digital Disease Detection Using Social Media 2017 (DDDSM-2017), 47–51.
Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. 7th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=Bkg6RiCqY7
Malik, M. G. A., Boitet, C., & Bhattacharyya, P. (2010). Analysis of Noori Nasta’leeq for major Pakistani languages. SLTU, 95–103.
Malik, M. K. (2017). Urdu Named Entity Recognition and Classification system using Artificial Neural Network. ACM Transactions on Asian and Low-Resource Language Information Processing, 17(1). https://doi.org/10.1145/3129290 DOI: https://doi.org/10.1145/3129290
McKinney, W., & others. (2011). pandas: a foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing, 14(9), 1–9.
Ming, H., Yang, J., Liu, S., Jiang, L., & An, N. (2025). Harnessing high-quality pseudo-labels for robust few-shot nested named entity recognition. Engineering Applications of Artificial Intelligence, 156, 110992. DOI: https://doi.org/10.1016/j.engappai.2025.110992
Nakayama, H. (2018). seqeval: A Python framework for sequence labeling evaluation. https://github.com/chakki-works/seqeval
Naz, S., Iqbal Umar, A., & Razzak, M. I. (2015). A hybrid approach for NER system for scarce resourced language-URDU: Integrating n-gram with rules and gazetteers. Mehran University Research Journal of Engineering & Technology, 34(4), 349–358. https://doi.org/10.3316/informit.153267579605416
Oprea, S. V., & Bâra, A. (2022). Why Is More Efficient to Combine BeautifulSoup and Selenium in Scraping For Data Under Energy Crisis. Ovidius University Annals, Economic Sciences Series, 22(2), 146–152. https://www.sas.com/en_ca/insights/articles/analytics/using-big-data-to-predictsuicide-risk- DOI: https://doi.org/10.61801/OUAESS.2022.2.19
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & others. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.
Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT? ArXiv Preprint ArXiv:1906.01502. DOI: https://doi.org/10.18653/v1/P19-1493
Qian, K., Raman, P. C., Li, Y., & Popa, L. (2020). Learning Structured Representations of Entity Names using Active Learning and Weak Supervision. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6376–6383. http://arxiv.org/abs/2011.00105 DOI: https://doi.org/10.18653/v1/2020.emnlp-main.517
Riaz, F., Anwar, M. W., & Muqades, H. (2020, February 1). Maximum Entropy based Urdu Named Entity Recognition. 2020 International Conference on Engineering and Emerging Technologies, ICEET 2020. https://doi.org/10.1109/ICEET48479.2020.9048203 DOI: https://doi.org/10.1109/ICEET48479.2020.9048203
Riaz, K. (2010). Rule-based Named Entity Recognition in Urdu. Proceedings of the 2010 Named Entities Workshop, 126–135.
Riaz, K. H. (2018). Improving Search via Named Entity Recognition in Morphologically Rich Languages-A Case Study in Urdu [Doctoral dissertation]. UNIVERSITY OF MINNESOTA.
Seow, W. L., Chaturvedi, I., Hogarth, A., Mao, R., & Cambria, E. (2025). A review of named entity recognition: from learning methods to modelling paradigms and tasks. Artificial Intelligence Review, 58(10), 315. DOI: https://doi.org/10.1007/s10462-025-11321-8
Sharma, S., Singh, P. P., & others. (2025). Named Entity Recognition for Hindi Current Landscape and Emerging Trends. Journal of Information Technology, Cybersecurity, and Artificial Intelligence, 2(2), 133–144. DOI: https://doi.org/10.70715/jitcai.2025.v2.i2.021
Singh, U., Goyal, V., & Lehal, G. S. (2012). Named Entity Recognition System for Urdu. Proceedings of COLING 2012, 2507–2518.
Ullah, Fida, Gelbukh, A., Zamir, M. T., Riverόn, E. M. F., & Sidorov, G. (2024). Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers, 13(10). https://doi.org/10.3390/computers13100258
Ullah, F, Gelbukh, A., Zamir, M. T., Riverόn, E. M. F., & Sidorov, G. (2024). Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers, 13. Article. DOI: https://doi.org/10.3390/computers13100258
Ullah, F., Ullah, I., & Kolesnikova, O. (2022). Urdu named entity recognition with attention bi-lstm-crf model. Mexican International Conference on Artificial Intelligence, 3–17. DOI: https://doi.org/10.1007/978-3-031-19496-2_1
Ullah, F., Zeeshan, M., Ullah, I., Alam, M. N., & Al-Absi, A. A. (2021). Towards Urdu Name Entity Recognition Using Bi-LSTM-CRF with Self-attention. International Conference on Smart Computing and Cyber Security: Strategic Foresight, Security Challenges and Innovation, 403–407. DOI: https://doi.org/10.1007/978-981-16-9480-6_38
Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021. DOI: https://doi.org/10.21105/joss.03021
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Dr.Nazish Basir, Dr.Mumtaz Qabulio, Dr.Muhammad Suleman Memon, Mr.Danish Nazir Arain, Dr.Sehrish Basir Nizamani, Dr.Saad Nizamani

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
