Part-Of-Speech Tagging for Balochi Language: A Data driven application of Conditional Random Fields

Sami Ullah; Najma Imtiaz Ali; Shah Murad Chandio; Imtiaz Ali Brohi; Barkat Ali  Laghari

doi:10.62019/abbdm.v4i1.111

Authors

Sami Ullah Institute of Mathematics and Computer Science, University of Sindh, Jamshoro, Pakistan.
Najma Imtiaz Ali Institute of Mathematics and Computer Science, University of Sindh, Jamshoro, Pakistan.
Shah Murad Chandio Institute of Mathematics and Computer Science, University of Sindh, Jamshoro, Pakistan.
Imtiaz Ali Brohi Government College University Hyderabad, Pakistan.
Barkat Ali Laghari Government College University Hyderabad, Pakistan.

DOI:

https://doi.org/10.62019/abbdm.v4i1.111

Abstract

Parts-of-Speech (POS) tagging involves the assignment of the correct part of speech or lexical category to individual words within a sentence in a natural language. This procedure holds significant in the field of Natural Language Processing (NLP) and find utility across a variety of NLP applications. Commonly, it constitutes the initial phase of natural language processing. Subsequent stages may encompass additional tasks such as chunking, parsing and more. Balochi stands as the predominant language in Balochistan,, ranking as the fourth most prevalent language in Pakistan. The field of natural language processing for Balochi is still in its nascent stages. In this research, we introduce an algorithm for Balochi part-of-speech tagging, leveraging machine learning techniques. The core of our approach relies on a Conditional Random Field model as the machine learning component. Careful consideration is given to selecting appropriate features for the CRF, taking into account the linguistic characteristics of Balochi. Balochi is currently considered a resource poor language, and thus, the available manually tagged data consists of only approximately 1500 sentences. The tagset used in this study created for research purpose, consisting of 16 different tags. The learning process incorporates tagged data. The algorithm demonstrates a high accuracy rate of 86.78% when applied to Balochi texts. The training corpus comprises 40000 words, while the test corpus contains 10000 words.