Advancing NLP for Underrepresented Languages: A Data-Driven Study on Shahmukhi Punjabi to Retrieve NER, Using RNN and LSTM

Syed Muhammad Hassan Zaidi; Syeda Nazia Ashraf; Adnan Ahmed; Basit Hasan; Irfan M. Leghari

doi:10.62019/abbdm.v4i4.238

Authors

Syed Muhammad Hassan Zaidi Department of Artificial Intelligence and Mathematical Sciences, Sindh Madressatul Islam University, Karachi, Pakistan.
Syeda Nazia Ashraf Department of Computer Science, Sindh Madressatul Islam University, Karachi, Pakistan.
Adnan Ahmed Department of Computer Science, Bahria University Karachi Campus, Pakistan.
Basit Hasan Department of Software Engineering, Sindh Madressatul Islam University, Karachi Pakistan.
Irfan M. Leghari Faculty of Computer Science &IT, University Malaysia Sarawak, Malaysia.

DOI:

https://doi.org/10.62019/abbdm.v4i4.238

Abstract

Shahmukhi Punjabi, an underrepresented language in computational linguistics, is gaining attention due to advancements in natural language processing (NLP). This paper examines the application of Named Entity Recognition (NER), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) models on a Shahmukhi Punjabi dataset. The study analyzes loss graphs, accuracy measures, and confusion matrices, offering insights into model performance. The field of NLP has seen significant growth, with increased focus on historically underrepresented languages. This research explores the use of sophisticated models like NER and RNN on Shahmukhi Punjabi, a language primarily spoken in Pakistan that presents unique challenges due to its spelling and linguistic nuances. Most current research centers on widely spoken languages, leaving a gap in our understanding of techniques applicable to Shahmukhi Punjabi. This work aims to address this gap by evaluating LSTM, RNN, and NER models on specific Shahmukhi Punjabi data. The study includes confusion matrices and loss graphs alongside standard accuracy measurements to provide a comprehensive view of model performance. Our LSTM model achieved 82% accuracy, while the RNN model achieved 82.57%. The results of this study are significant as demonstrated the potential of advanced NLP models in processing and understanding Shahmukhi Punjabi. By focusing on this underrepresented language, the research contributes to the broader goal of making NLP tools more inclusive and effective across diverse linguistic landscapes. The findings also highlight the importance of developing tailored approaches to handle the unique characteristics of different languages, ensuring that technological advancements benefit a wider range of linguistic communities.