American Journal of Computing Research Repository, 2014, Vol. 2, No. 1, 1-7 
Available online at http://pubs.sciepub.com/ajcrr/2/1/1 
© Science and Education Publishing 
DOI:10.12691/ajcrr-2-1-1 

Word Segmentation Model for Sindhi Text 

Zeeshan Bhatti*, Imdad Ali Ismaili, Waseem Javaid Soomro, Dil Nawaz Hakro 

Institute of Information and Communication Technology, University of Sindh, Jamshoro 
*Corresponding author: zeeshan.bhatti@usindh.edu.pk 

Received November 25, 2013; Revised December 15, 2013; Accepted January 01, 2014 

Abstract  Through this research the problem of Sindhi Word Segmentation has been addressed and various 
techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any 
tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able 
to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite 
complex and rich having large number of characters in its script with all characters having multiple glyph’s based on 
its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various 
algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating 
word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is 
resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each 
list element is a complete sentence. Then the segregated sentences are broken down into words with hard space 
character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting. 
Finally each word is again filtered to remove special characters and then each word is converted and saved as token 
after validation. 

Keywords: word segmentation, sindhi tokenization, sindhi language, Sindhi Spell Checker 

Cite This Article: Zeeshan Bhatti, Imdad Ali Ismaili, Waseem Javaid Soomro， and Dil Nawaz Hakro, 
“Word Segmentation Model for Sindhi Text.” American Journal of Computing Research Repository 2, no. 1 
(2014): 1-7. doi: 10.12691/ajcrr-2-1-1. 

1. Introduction 
The process of segregating and isolating the sentence 

into individual token of words, is termed as Word 
segmentation or tokenization [1]. In Natural Language 
Processing (NLP) the term tokenization or word 
segmentation is deemed as the most fundamental task [2]. 
Almost every application of NLP requires at certain stages 
the process of breaking its text into individual tokens for 
processing -for example, in Machine Translation (MT) 
and Spell Checking [2,3]. The tokenization process is 
done by identifying word boundaries in languages like 
English where punctuation marks or white spaces are used 
to segregate words [3]. The scanning routines usually 
include various algorithms for handling morphology in a 
language-dependent manner. Even for a language like 
English, which is very lightly inflected, the phenomena of 
contraction and possessives will also need to be handled 
within the word extraction routines [4,5]. Sindhi, similar 
to other Asian languages -like Urdu, Arabic, Persian, 
endures the same problem of text segmentation with space 
omission and insertion issues. 

Sindhi is an official State language of Sindh province in 
Pakistan and is spoken by approximately 34.4 million 
people in Pakistan and around 2.8 million people in India 
[6]. Sindhi script is based on Persio-Arabic script, with 
Arabic Nashk style of writing, from Right-to-Left 
direction with cursive ligature system [6]. Sindhi script 

has cursive behavior in its written form, having 
subsequent characters; in a word, joined with each other as 
shown in Figure 1. Due to its cursive nature and having 
Aerabs (diacritics marks) makes Sindhi text difficult to 
process in applications of NLP. For any application of 
NLP it’s extremely vital that a standard corpus of a 
language is built so that the text can be processed and 
compared with some statistical analysis [7]. Therefore, the 
need for developing a formal Sindhi corpus is eminent and 
a model is needed for the tokenization of Sindhi words. 

This paper discusses the Sindhi word segmentation 
technique for the development of Sindhi corpus and 
tokenizing Sindhi text, to build a repository for Sindhi 
words for NLP applications like Spell Checkers. Sindhi 
word boundaries from within the text are identified by 
finding the hard space character. Sindhi; being a very 
complex language, possess fifty two characters in its script 
with each character having separate glyph shapes, based 
on the position of each character in a string. This 
consequently generates the case of ambiguity in Sindh in 
Script, as the Sindhi language contains two types of letters 
– connectors and non-connectors. Sindhi word therefore 
uses soft space as well as hard space characters as shown 
in Figure 1. 

 
Figure 1. Sindhi text 

 
2 American Journal of Computing Research Repository  

1.1. Related Work 
Most of the modern languages in the world have 

already developed various tools and techniques for 
segmenting their written text and documents for spell 
checking and correction. A part from languages of the 
European countries, the algorithms for word tokenization 
has been implemented for various other languages spoken 
in Asian counties. Some of the relevant work done in this 
regard includes: word segmentation done for Arabic 
language [8,9,10,11], Bangla [12], Hindi [13], Nepali [14], 
Tamil [15] and Urdu [16,17]. These are few examples; 
however, unfortunately very little work has been done in 
this regard for Sindhi Language. 

For segmenting Arabic text, Sheikh et al, proposes 
Arabic Words/sub-words segmentation into characters 
using primary and secondary strokes with vertical 
projection graphs [18] but for OCR systems only and not 
for digital text. Similarly Shaikh et al. uses Height Profile 
Vector (HPV) to segment Sindhi Characters [19], but 
again for printed or handwritten scanned Sindhi text. His 
approach addresses the problem of segmenting for OCR 
systems and not for digital text. Durrani N. and Hussain S. 
address the orthographic and linguistics features of Urdu 
language for word segmentation, employing a hybrid 
solution of n-gram ranking with rule based matching 
heuristics [3]. On the other hand, Akram M. in his thesis 
discusses statistical solution of word segmentation for 
Urdu Language [20] but again for OCR systems. However, 
Mahar et al. develops five Algorithms based on Lexicon 
driven approach for Sindhi Word Segmentation into 
possible morpheme sequences [21]. 

Similarly the most relevant work on Sindhi Text 
Segmentation is done by Mahar et al. discussed in [1], in 
which he presents a layer based model for Sindhi text 
segmentation. However in his work he uses three layers, 
where each layer segment words with varying degree of 
intricacy, from simple, compound to complex Sindhi 
words. Contrary to this and other techniques discussed, we 
have addressed the problem of segmenting Sindhi Words 
that are already in digital or textual form, taken from 
internet or typed into a word processor for the purpose of 
corpus building, constructing word repository, machine 
translation, spell checking, grammar checking and text to 
speech systems etc. Our adapted technique works on 
identifying sentence boundaries, then tokenizing the 
words from sentence and validating the isolated word for 
accuracy. 

1.2. Character Glyph and Space Types in 
Sindhi Text 

The 52 characters in Sindhi script have multiple shape 
or glyph representations according to their position in the 
word. There are four different category of shapes that a 
character may possess with respect to its placement in text, 
initial or start, medial or middle, final or end, and 
standalone or isolated as shown in Figure 2. Soft spaces in 
Sindhi script are used to separate certain characters that do 
not have cursive context sensitive shapes for all four 
positions. For example in Table 1 the character {Dhaal} 
 has two basic types of cursive shapes, isolated and start ’ذ‘
shape, and middle and end shape. When ذ is used at start 
of the word, it does not join with any other characters. 

Hence, a soft space is inserted to indicate the separation. 
This soft space does not indicate the word boundary as it 
is not a character, as compared to hard space, which is a 
character itself having no glyph or shape and occupies 
space. 

 
Figure 2. Different shapes of characters according to position in a Word 

Table 1. Shape group of ’ذ’ character 

Isolate Start Middle End 

راذ ذ ذيذل  ذلذي   

As the hard space is an individual character having a 
specific ASCII (32) and Unicode (u+0020) designated 
code that occupies memory and screen space when used, 
provide a very easy to identify marker for word 
boundaries in segmentation process. Whereas the use of 
soft spaces is majorly to do with the character shape 
groups and placement, thus it does not possess any form 
of ASCII or Unicode representation, nor does it occupy 
memory. Therefore, the soft space is not considered as an 
individual character and consequently it is not used to 
identify word boundaries. 

2. Architecture of the System 
The System under study has been segregated into two 

main sections. In the first section, segmenting Sindhi 
words into tokens is performed and in the second section, 
verifying each generated tokens is done. The validation of 
results is done by using a prebuilt words repository. 
Previously developed Sindhi word processor software (by 
the author [6]) is used as the primary tool for working 
with Sindhi text and showing the results. In the first 
section, set of algorithms and routines have been 
implemented to scan the text and extract the Sindhi words 
and then each word is compared to a repository of Sindhi 
words for verification. The scanning routines evaluate 
each word using a set of rules to be a valid token or not. If 
a token is found to be invalid, then it is marked as 
incorrect and is simply casted away by putting it inside a 
list with all ignored and unwanted words. This 
development methodology is illustrated in Figure 3 
showing various stages of the system. These stages of the 
system architecture are explained in the sub sections 
below. 

 
 American Journal of Computing Research Repository 3 

 
Figure 3. Various development stages of the System 

3. Scanning and Extraction of Sindhi 
Words 

The first stage of developing Sindhi word segmentation 
model, works at four different levels. The system is able to 
do: 

1. Text segmentation into sentences. 
2. Sentence Segmentation into Words. 
3. Tokens Creation. 
4. Token Matching. 

At each level, set of routines are used to parse Sindhi 
text, extracting sentences and creating valid tokens. These 
routines are discussed further in the subsequent sections. 

3.1. Text Segmentation into Sentences 
Initially Sindhi text written in Sindhi Word Processor 

document is read and scanned. Then the text is separated 
into sentences by identifying the sentence boundaries 
which is from the start of the text to the next full stop (.) 
as shown in Figure 4 below. The sentence boundaries are 
identified by a full stop (.) and a question mark (?). Along 
with these two, sentence boundaries are also marked by 
identifying the end of paragraph and start of new line. 

 
Figure 4. Scanning and identifying sentence boundaries 

Each input text is scanned first from the beginning of 
the text starting from the initial starting position of the 
sentence. Then end of sentence identifier, which is either a 
full stop (.) or a question mark (?), is searched for and 
marked. The system then fetches the sentence form the 

start index to the end index. After the first sentence is 
isolated the next sentence is searched for and the start and 
end index markers are reset. Following is the 
implementation details for scanning Sindhi text and 
identifying sentence boundaries. 

 
4 American Journal of Computing Research Repository  

 
3.2. Sentence Segmentation into Words 
For the segmentation and creation of the Sindhi word 

from sentences, word boundaries have to be identified and 
marked. The word boundaries are determined by the space 
character before and after each word. As discussed earlier, 
Sindhi script possess two types of space characters in 
written or typed form, hard spaces and soft space. For the 
purpose of word token generation the hard space has been 
used in this system as a word boundary identifier. The soft 
spaces are ignored and counted as part of word as shown 
in Figure 5. 

 
Figure 5. Sindhi text with word boundaries marked at hard spaces 

Here it is to be noted that in our system we have 
adopted and used the terminology of token and word 
separately. We consider the word to be in a general form 
separated using spaces. The generated word token may be 
a misspelled or incorrectly typed word having invalid 
spelling. The word may also contain characters such as 
punctuation marks, special symbols such as: “@, &, *, !, #, 
etc.”. Thus at this point all such words are considered to 
be a general form of words correctly segmented and 
segregated. Whereas, the tokens generated will be 
correctly spelled words verified from a correctly spelled 
repository of Sindhi Words. At the beginning of process 
we change the local environment variable to Arabic ‘AR’ 
for our compiler and interpreter to be able to read and 
process the text from right-to-left order for Sindhi script. 
Then, a break iterator object, which is used to break each 
word according to the index of hard space, is declared. 
The word is isolated for further verification and validation 

before it can be saved into a token. This process of 
segmentation differentiates between words and characters 
that are not part of words. These characters are ignored 
and skipped to achieve accuracy of tokens formed. The 
ignored characters include spaces, tabs, punctuation marks, 
and most symbols, have word boundaries on both sides. In 
algorithm 2 the implementation details have been 
provided for segmenting sentences into words. 

 
3.3. Sindhi Word Tokens 
After the identification of word boundaries, each word 

is then isolated and put inside a hash list. The words are 
identified and each word from the list is retrieved and 
analyzed for validity. Each token is created by identifying 
the correctly spelled word and removing any additional 
unnecessary characters that may be part of the original 
words. This filtration process involves traversing each 
character with the word and removing all special 
characters such as @,#,$,%,^,&,*,(,),_,+,-
,=,{,},|,[,],:,;,<,>,?, etc. All these type of characters can be 
part of the string and may have been attached to word in 
previous stage. Along with these, the hard space 
characters, newline and new paragraph symbols are also 
trimmed out. More importantly the filtration process 
eradicates the occurrence of any letter from English 
alphabet as it is very common to use English words at 
certain places in a Sindhi document or article. In the last 
stage, each word is compared form a known list of 
repository of Sindhi Words for final validation. Figure 6 
shows the tokens created. 

 
Figure 6. Word tokens created from a sentence 

 
 American Journal of Computing Research Repository 5 

In the following algorithm 3, the implementation of 
token creation has been shown. Here ‘filter Char’ array 
contains the information to filter the invalid characters 
from the word. For Sindhi word segmentation system 
invalid character are those characters that are not in Sindhi 
alphabet, such as punctuation marks, extra spaces, 
numerical, braces, etc. 

 
3.4. Token Matching 
After creating each token from the Sindhi text, the 

tokens are searched and matched with the Sindhi wordlist 
from the repository. The system uses hash tables to store 
the wordlist hence the basic operations of searching and 
matching become very simple. We have used the same 
technique of matching has key to search a token from the 
hash table based repository as discussed in [22].The 
system uses Hash Structure Algorithm for fast and 
efficient searching. The analysis and verification of words 

also involves validating the error patterns and trends in 
spelling mistakes that occurs while typing Sindhi text, 
results of which have already been published in [23]. The 
Figure 7 shows the basic structure of the token matching 
done by the system. 

 
Figure 7. A Hash Table Structure for Sindhi Words 

4. Results 
The proposed model has been tested on various corpus 

of Sindhi text collected through internet (general articles 
and news articles) and from publisher (book chapter and 
Digital dictionary). The detail tokenization report is shown 
in Table 2. The articles taken form Sindhi literature books, 
were initially typed into the Sindhi word processor with a 
Sindhi keyboard designed for Sindhi typing used in [24]. 
Total of 157,509 words were generated by the proposed 
tokenization model, among which 146,132 words were 
verified and marked as correct tokens and valid Sindhi 
word tokens having the cumulative accuracy 92.78% of 
the model. Some 4645 tokens that were generated were 
considered invalid by the system and 7732 tokens were 
completely ignored due to some anomalies like having 
special characters or unknown Unicode literal in them. 

Table 2. Results of the proposed segmentation model 
Source Paragraphs Lines Words Incorrect Tokens Tokens ignored Total Words 

15 Articles 14 1189 13622 1877 1332 16831 
100 News Articles 224 2363 30267 1534 745 32546 
Book Chapters (5) 104 1055 19589 1234 1689 22512 

Sindhi Digital Dictionary --- --- 82654 --- 2966 85620 
Total 1,46,132 4645 7732 1,57,509 

The Figure 8 (a) below shows the diagrammatic 
comparison among the tokens generated from various 
corpus of Sindhi Script. The Figure 8 (b) shows the 
cumulative accuracy compares to the incorrect and 
ignored tokens by the model. 

 
Figure 8(a). Bar graph showing the tokens generated 

 
Figure 8(b). Pie chart showing the cumulativeaccuracy of the system 

The overall accuracy of generating tokens by the 
proposed word segmentation model is shown given in 
Table 3 with the graph shown in Figure 9 illustrating the 
accuracy difference between various corpus of Sindhi text 
and how tokenization is varied in them. 

 
6 American Journal of Computing Research Repository  

Table 3. Accuracy of Proposed Model 
Source Accuracy % 
15 Articles 80.94 
100 News Articles 93 
Book Chapters (5) 87.02 
Sindhi Digital Dictionary 96.54 

 
Figure 9. Graphical representation of calculated accuracy in proposed word segmentation model 

5. Conclusion 
The work presented in this paper shows the technique 

and algorithms used to tokenize Sindhi words from a 
given Sindhi text document. Each algorithm has been 
discussed and implementation given. The results are 
analyzed by checking each token generated by the system 
with a given list of words from a prebuilt Sindhi words 
repository (for spell checking). Each token is identified by 
the system to be correct and all invalid token are marked 
as an incorrect and are ignored. The results of proposed 
model are very sizable and accurate. The algorithm can be 
further utilized to segment Sindhi words for various other 
NLP purpose like machine translation, spell checking, 
grammar checking and text to speech systems. 

References 
[1] Mahar, J. A., Shaikh, H., Memon,G. Q., “A Model for Sindhi Text 

Segmentation into Word Tokens”, Sindh University Research 
Journal (Science Series), Vol.44 (1) pp.43-48 (2012). 

[2] Haruechaiyasak, C.; Kongyoung, S.; Dailey, M.; “A comparative 
study on Thai word segmentation approaches,” Electrical 
Engineering/Electronics, Computer, Telecommunications and 
Information Technology, 2008. ECTI-CON 2008. 5th 
International Conference on, vol.1, no., pp.125-128, 14-17 May 
2008. 

[3] Nadir D. And Sarmad H. 2010. Urdu word segmentation. In 
Human Language Technologies: The 2010 Annual Conference of 
the North American Chapter of the Association for Computational 
Linguistics (HLT '10). Association for Computational Linguistics, 
Stroudsburg, PA, USA, 528-536. 

[4] “A fast morphological algorithm with unknown word guessing 
induced by a dictionary for web search engine” Source: 
http://company.yandex.ru/articles/iseg-las-vegas.xml Retrieved on: 
12, June 2011. 

[5] Hull, D. A, “Stemming Algorithms A Case Study for Detailed 
Evaluation,” (Rank Xerox Research Centre), JASIS vol. 47, 1996. 

[6] Ismaili, I.A, Bhatti, Z., Shah, A. A. “Design and Development of 
Graphical User Interface for Sindhi Language (GUISL)”. Mehran 
University Research Journal of Engineering & Technology, 
Volume 30, No. 4, October 2011. 

[7] Rahman M U (2010). Towards Sindhi Corpus Construction, 
Conference on Language and Technology, Lahore, Pakistan. 

[8] Shaalan K. “Arabic GramCheck: A Grammar Checker for Arabic”, 
Software Practice and Experience, John Wiley & sons Ltd., UK, 
35(7):643-665, June 2005. 

[9] Zribi, C. B. O. And Ben Ahmed, M. 2003. “Efficient automatic 
correction of misspelled Arabic words based on contextual 
information.” In Proceedings of the 7th International Conference 
on Knowledge-Based Intelligent Information and Engineering 
Systems (KES’03). V. Palade, R. J. Howlett, and L. Jain Eds., 
Oxford, Springer, 770-777. 

[10] Farghaly, A., Shaalan, K. “Arabic Natural Language Processing: 
Challenges and Solutions,” ACM Transactions on Asian Language 
Information Processing (TALIP), the Association for Computing 
Machinery (ACM), 8(4)1-22, December 2009, 8(4), 1-22. 

[11] Shaalan, K., Allam, A., Gohah, A., “Towards Automatic Spell 
Checking for Arabic”. Conference on Language Engineering, 
ELSE, Cairo, Egypt, 2003. 

[12] Uzzaman, N., And Khan, M., “A Double Metaphone Encoding for 
Bangla and its Application in Spelling Checker”, Proc. 2005 IEEE 
Natural Language Processing and Knowledge Engineering, 
Wuhan, China, October, 2005. 

[13] Chaudhuri,B. B., “Towards Indian Language Spell-checker 
Design,” lec, pp.139, Language Engineering Conference (LEC'02), 
2002. 

[14] Bal K. B. Et. Al., “Nepali Spellchecker”, PAN Localization 
Working Papers 2004-2007, Centre for Research in Urdu 
Language Processing, National University of compute and 
Emerging Sciences, Lahore, Pakistan, pp.316-318. 

[15] Dhanabalan, T., Parthasarathi, R., & Geetha, T. V. (N.D.). “Tamil 
Spell Checker” Resource Center for Indian Language Technology 
Solutions – Tamil, School of Computer Science and Engineering, 
Anna University, Chennai, India, pp.18-27. 2003. 

[16] Naseem, T., & Hussain, S. “Spelling Error Corrections for Urdu”. 
Published online: 26 September 2007 © Springer Science 
Business Media B.V. 2007. PAN Localization Working Papers 
2007, Centre for Research in Urdu Language Processing, National 
University of compute and Emerging Sciences, Lahore, Pakistan, 
pp.117-128. 

[17] Naseem, T. And Hussain, S, “A Novel Approach for Ranking 
Spelling Mistakes in Urdu”, Language Resources and Evaluation, 
2007. 41:117-128. 

 
 American Journal of Computing Research Repository 7 

[18] Shaikh, N. A., Shaikh, Z. A., & Ali, G. (2009). Segmentation of 
Arabic text into characters for recognition. In Wireless Networks, 
Information Processing and Systems (pp.11-18). Springer Berlin 
Heidelberg. 

[19] Shaikh, N. A., Mallah, G. A., & Shaikh, Z. A. (2009). Character 
Segmentation of Sindhi, an Arabic Style Scripting Language, 
using Height Profile Vector.Australian Journal of Basic and 
Applied Sciences, 3(4), 4160-4169. 

[20] Akram, M. (2009) “Word Segmentation for Urdu OCR System”, 
Master’s Thesis, Department of Computer Science, National 
University Of Computer & Emerging Sciences, Lahore, Pakistan. 

[21] Mahar, J.A., Memon, G. Q., Danwar, H.S., (2011), Algorithms for 
Sindhi Word Segemtnatin Using Lexicon Driven Approach, 

International Journal of Academic Research, Vol. 3. No.3. May, 
2011. 

[22] Ismaili, I.A., Bhatti, Z., Shah, A. A., “Development of Unicode 
based bilingual Sindhi-English Dictionary”. Mehran University 
Research Journal of Engineering & Technology Volume 31, No. 1, 
January 2012. 

[23] Bhatti, Z., Ismaili, I.A., Shaikh, A. A., Soomro, W. J. “Spelling 
Error Trends and Patterns in Sindhi”. Journal of Emerging Trends 
in Computing and Information Sciences, Vol. 3, No.10, 2012. 

[24] Bhatti, Z., Ismaili, I.A., Khan, W., Nizamani, A. S., “Development 
of Unicode based Sindhi Typing System”, Journal of Emerging 
Trends in Computing and Information Sciences, Vol. 4 No. 3, 
2013.