Advancing NLP for Underrepresented Languages: A Data-Driven Study on Shahmukhi Punjabi to Retrieve NER, Using RNN and LSTM
DOI:
https://doi.org/10.62019/abbdm.v4i4.238Abstract
Shahmukhi Punjabi, an underrepresented language in computational linguistics, is gaining attention due to advancements in natural language processing (NLP). This paper examines the application of Named Entity Recognition (NER), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) models on a Shahmukhi Punjabi dataset. The study analyzes loss graphs, accuracy measures, and confusion matrices, offering insights into model performance. The field of NLP has seen significant growth, with increased focus on historically underrepresented languages. This research explores the use of sophisticated models like NER and RNN on Shahmukhi Punjabi, a language primarily spoken in Pakistan that presents unique challenges due to its spelling and linguistic nuances. Most current research centers on widely spoken languages, leaving a gap in our understanding of techniques applicable to Shahmukhi Punjabi. This work aims to address this gap by evaluating LSTM, RNN, and NER models on specific Shahmukhi Punjabi data. The study includes confusion matrices and loss graphs alongside standard accuracy measurements to provide a comprehensive view of model performance. Our LSTM model achieved 82% accuracy, while the RNN model achieved 82.57%. The results of this study are significant as demonstrated the potential of advanced NLP models in processing and understanding Shahmukhi Punjabi. By focusing on this underrepresented language, the research contributes to the broader goal of making NLP tools more inclusive and effective across diverse linguistic landscapes. The findings also highlight the importance of developing tailored approaches to handle the unique characteristics of different languages, ensuring that technological advancements benefit a wider range of linguistic communities.

Downloads
Published
Issue
Section
License
Copyright (c) 2024 Syed Muhammad Hassan Zaidi , Syeda Nazia Ashraf , Adnan Ahmed, Basit Hasan, Irfan M. Leghari

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.