Expanded Entity Coverage and Machine-Annotated Pre-Training for Urdu Named Entity Recognition

Authors

DOI:

https://doi.org/10.62019/gypnn177

Keywords:

Automatic, text summarization, natural language processing, deep learning.

Abstract

Named Entity Recognition (NER) for low-resource languages remains challenging due to limited annotated corpora and complex linguistic characteristics. Urdu is a morphologically rich Indo-Aryan language written in a cursive right-to-left script, increases these challenges in contrast with high-resource languages in NER performance. This paper presents a data-efficient framework for Urdu NER that combines large-scale machine-generated annotations with high-quality human-annotated data to mitigate annotation scarcity. Firstly, a new gold-standard human-annotated Urdu NER corpus is built comprising 49,040 tokens and 5,839 named entities across 13 fine-grained entity categories, including newly introduced types such as Sports, Food, and Color. To complement this dataset, a large machine-annotated corpus was created using a bootstrapped ensemble of Conditional Random Field models with confidence-based filtering. A two-stage training strategy in which multilingual transformer models, Multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R) are pre-trained on the machine-annotated corpus and subsequently fine-tuned on the human-annotated data. Experimental results demonstrate that machine-annotation pre-training consistently improves NER performance for both models, yielding micro-averaged F1-score gains from 0.84 to 0.86 for mBERT and from 0.86 to 0.88 for XLM-R. Detailed per-class and confusion matrix analyses further show notable improvements for low-frequency and specialized entity types. The findings confirm that integrating weakly supervised machine annotation with multilingual transformer-based learning provides a practical and scalable solution for improving NER in Urdu and other low-resource languages.

Keywords: NER, machine-annotated corpus, low-resource languages, Urdu NER corpus

References

Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459. DOI: https://doi.org/10.1002/wics.101

Ahmed, A., Huang, D., & Arafat, S. Y. (2024). Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(4). https://doi.org/10.1145/3648362 DOI: https://doi.org/10.1145/3648362

Anam, R., Anwar, M. W., Jamal, M. H., Bajwa, U. I., De la Torre Diez, I., Alvarado, E. S., Flores, E. S., & Ashraf, I. (2024). A deep learning approach for Named Entity Recognition in Urdu language. PLoS ONE, 19(3 March). https://doi.org/10.1371/journal.pone.0300725 DOI: https://doi.org/10.1371/journal.pone.0300725

Aziz, K., Ahmed, N., Yu, Y., Hadi, H. J., Alshara, M. A., Tariq, U., & Ji, D. (2025). Advancing Urdu named entity recognition: deep learning for aspect targeting. Complex & Intelligent Systems, 11(12), 489. DOI: https://doi.org/10.1007/s40747-025-02066-6

Basir, N., Hakro, D. N., Khoumbati, K. U. R., & Bhatti, Z. (2025). Leveraging machine-labeled data and cross-lingual transfer for NER in Urdu and sindhi. J. Inf. Commun. Technol.—(JICT), 19, 1–8. DOI: https://doi.org/10.1109/ICET66147.2025.11321232

Chen, S., Pei, Y., Ke, Z., & Silamu, W. (2021). Low-resource named entity recognition via the pre-training model. Symmetry, 13(5), 786. DOI: https://doi.org/10.3390/sym13050786

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised Cross-lingual Representation Learning at Scale. ArXiv Preprint ArXiv:1911.02116. https://github.com/facebookresearch/cc DOI: https://doi.org/10.18653/v1/2020.acl-main.747

Daud, A., Khan, W., & Che, D. (2017). Urdu language processing: a survey. Artificial Intelligence Review, 47(3), 279–311. DOI: https://doi.org/10.1007/s10462-016-9482-x

Farrugia, K., & Wahlberg, F. (2022). Multilingual Transformer Models for Maltese Named Entity Recognition. Uppsala University.

Gligic, L., Kormilitzin, A., Goldberg, P., & Nevado-Holgado, A. (2020). Named Entity Recognition in Electronic Health Records Using Transfer Learning Bootstrapped Neural Networks. Neural Networks, 121, 132–139. DOI: https://doi.org/10.1016/j.neunet.2019.08.032

Jahangir, F., Anwar, W., Ijaz Bajwa, U., & Wang, X. (2012). N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language. Proceedings of the 10th Workshop on Asian Language Resources, 95–104.

Jain, S. M. (2022). Hugging face. In Introduction to transformers for NLP: With the hugging face library and models to solve problems (pp. 51–67). Springer. DOI: https://doi.org/10.1007/978-1-4842-8844-3_4

Jiang, J., Shu, Y., Wang, J., & Long, M. (2022). Transferability in deep learning: A survey. ArXiv Preprint ArXiv:2201.05867.

Kamran Malik, M., & Mansoor Sarwar, S. (2016). Named Entity Recognition System for Postpositional Languages: Urdu as a Case Study. (IJACSA) International Journal of Advanced Computer Science and Applications, 7(10). www.ijacsa.thesai.org DOI: https://doi.org/10.14569/IJACSA.2016.071019

Kazi, S., Rahim, M., & Khoja, S. (2023). A deep learning approach to building a framework for Urdu POS and NER. Journal of Intelligent and Fuzzy Systems, 44(2), 3341–3351. https://doi.org/10.3233/JIFS-211275 DOI: https://doi.org/10.3233/JIFS-211275

Khairunnisa, S. O., Chen, Z., & Komachi, M. (2023). Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(6), 1–21. https://doi.org/10.1145/3592854 DOI: https://doi.org/10.1145/3592854

Khan, W., Daud, A., Alotaibi, F., Aljohani, N., & Arafat, S. (2020). Deep recurrent neural networks with word embeddings for Urdu named entity recognition. ETRI Journal, 42(1), 90–100. https://doi.org/10.4218/etrij.2018-0553 DOI: https://doi.org/10.4218/etrij.2018-0553

Kim, J., Ko, Y., & Seo, J. (2020). Construction of Machine-Labeled Data for Improving Named Entity Recognition by Transfer Learning. IEEE Access, 8, 59684–59693. https://doi.org/10.1109/ACCESS.2020.2981361 DOI: https://doi.org/10.1109/ACCESS.2020.2981361

Kim, J., Kwon, S., Ko, Y., & Seo, J. (2017). A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains. Roceedings of the International Workshop on Digital Disease Detection Using Social Media 2017 (DDDSM-2017), 47–51.

Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. 7th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=Bkg6RiCqY7

Malik, M. G. A., Boitet, C., & Bhattacharyya, P. (2010). Analysis of Noori Nasta’leeq for major Pakistani languages. SLTU, 95–103.

Malik, M. K. (2017). Urdu Named Entity Recognition and Classification system using Artificial Neural Network. ACM Transactions on Asian and Low-Resource Language Information Processing, 17(1). https://doi.org/10.1145/3129290 DOI: https://doi.org/10.1145/3129290

McKinney, W., & others. (2011). pandas: a foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing, 14(9), 1–9.

Ming, H., Yang, J., Liu, S., Jiang, L., & An, N. (2025). Harnessing high-quality pseudo-labels for robust few-shot nested named entity recognition. Engineering Applications of Artificial Intelligence, 156, 110992. DOI: https://doi.org/10.1016/j.engappai.2025.110992

Nakayama, H. (2018). seqeval: A Python framework for sequence labeling evaluation. https://github.com/chakki-works/seqeval

Naz, S., Iqbal Umar, A., & Razzak, M. I. (2015). A hybrid approach for NER system for scarce resourced language-URDU: Integrating n-gram with rules and gazetteers. Mehran University Research Journal of Engineering & Technology, 34(4), 349–358. https://doi.org/10.3316/informit.153267579605416

Oprea, S. V., & Bâra, A. (2022). Why Is More Efficient to Combine BeautifulSoup and Selenium in Scraping For Data Under Energy Crisis. Ovidius University Annals, Economic Sciences Series, 22(2), 146–152. https://www.sas.com/en_ca/insights/articles/analytics/using-big-data-to-predictsuicide-risk- DOI: https://doi.org/10.61801/OUAESS.2022.2.19

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & others. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.

Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT? ArXiv Preprint ArXiv:1906.01502. DOI: https://doi.org/10.18653/v1/P19-1493

Qian, K., Raman, P. C., Li, Y., & Popa, L. (2020). Learning Structured Representations of Entity Names using Active Learning and Weak Supervision. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6376–6383. http://arxiv.org/abs/2011.00105 DOI: https://doi.org/10.18653/v1/2020.emnlp-main.517

Riaz, F., Anwar, M. W., & Muqades, H. (2020, February 1). Maximum Entropy based Urdu Named Entity Recognition. 2020 International Conference on Engineering and Emerging Technologies, ICEET 2020. https://doi.org/10.1109/ICEET48479.2020.9048203 DOI: https://doi.org/10.1109/ICEET48479.2020.9048203

Riaz, K. (2010). Rule-based Named Entity Recognition in Urdu. Proceedings of the 2010 Named Entities Workshop, 126–135.

Riaz, K. H. (2018). Improving Search via Named Entity Recognition in Morphologically Rich Languages-A Case Study in Urdu [Doctoral dissertation]. UNIVERSITY OF MINNESOTA.

Seow, W. L., Chaturvedi, I., Hogarth, A., Mao, R., & Cambria, E. (2025). A review of named entity recognition: from learning methods to modelling paradigms and tasks. Artificial Intelligence Review, 58(10), 315. DOI: https://doi.org/10.1007/s10462-025-11321-8

Sharma, S., Singh, P. P., & others. (2025). Named Entity Recognition for Hindi Current Landscape and Emerging Trends. Journal of Information Technology, Cybersecurity, and Artificial Intelligence, 2(2), 133–144. DOI: https://doi.org/10.70715/jitcai.2025.v2.i2.021

Singh, U., Goyal, V., & Lehal, G. S. (2012). Named Entity Recognition System for Urdu. Proceedings of COLING 2012, 2507–2518.

Ullah, Fida, Gelbukh, A., Zamir, M. T., Riverόn, E. M. F., & Sidorov, G. (2024). Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers, 13(10). https://doi.org/10.3390/computers13100258

Ullah, F, Gelbukh, A., Zamir, M. T., Riverόn, E. M. F., & Sidorov, G. (2024). Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers, 13. Article. DOI: https://doi.org/10.3390/computers13100258

Ullah, F., Ullah, I., & Kolesnikova, O. (2022). Urdu named entity recognition with attention bi-lstm-crf model. Mexican International Conference on Artificial Intelligence, 3–17. DOI: https://doi.org/10.1007/978-3-031-19496-2_1

Ullah, F., Zeeshan, M., Ullah, I., Alam, M. N., & Al-Absi, A. A. (2021). Towards Urdu Name Entity Recognition Using Bi-LSTM-CRF with Self-attention. International Conference on Smart Computing and Cyber Security: Strategic Foresight, Security Challenges and Innovation, 403–407. DOI: https://doi.org/10.1007/978-981-16-9480-6_38

Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021. DOI: https://doi.org/10.21105/joss.03021

Downloads

Published

2026-02-02

How to Cite

Expanded Entity Coverage and Machine-Annotated Pre-Training for Urdu Named Entity Recognition. (2026). The Asian Bulletin of Big Data Management , 6(1), 77-93. https://doi.org/10.62019/gypnn177

Similar Articles

1-10 of 156

You may also start an advanced similarity search for this article.