STANDARD CODE TABLE FOR URDU
Dr. Khaver ZIA
FAST Institute of Computer Science, Lahore. Pakistan
The paper reports on the issues involved in preparing a standard code table for Urdu. In this paper, first the characteristics of Urdu language are highlighted. Then, requirement specifications of Urdu code table are listed. Based on these a code table is proposed. The features of the code table are then discussed. Finally the direction of future enhancements is identified.
character codes, character encoding, code table, multilingual computing, Urdu
The need for a standard code table for computerized language processing cannot be overemphasized. The standard code table has the same relation to computerized language processing as a standard keyboard has to typewriting. Unfortunately, until the present development, standard code table for the Urdu language did not exist. This resulted in a multitude of detrimental effects. A programmer developing application software for Urdu had no guidelines regarding which binary code to assign to different alphabetical characters and had to formulate a customized code table. This severely restricted the portability of application software. Further, Urdu language data and information could not be ported from one platform to another or across a computer network.
A committee comprising linguists and computer experts was formed to rectify this situation. The committee examined all conceivable issues and problems relating to the design of the code table. After extensive deliberations a standard code table was proposed. Later on, the Urdu National Language Authority approved and presented it to the Government of Pakistan for its regulation.
This paper reports on the salient issues that were considered in formulating the code table along with some of its important features.
2. CHARACTERISTICS OF URDU LANGUAGE
Urdu enjoys the status of being the national language and lingua franca of Pakistan. The distinguishing characteristics of the Urdu are enumerated for the benefit of the unacquainted reader. Urdu has its origin in Arabic and Persian languages but is also influenced by Turkish and Sanskrit. Its alphabet is a super set of Arabic and Persian and contains 39 characters. Figure 1 shows the alphabet of Urdu. Urdu is written from right side to left. Unlike English, the characters do not have upper and lower case. Further, the shape assumed by a character in a word is context sensitive i.e. the shape is different depending whether the position of the character is at the beginning, in the middle or at the end of the constituent word. This generates three shapes, the fourth being the independent shape of the character. To be precise, the above is true for all except eleven characters. Ten of these have only two shapes; the independent and the terminating shape while one character, namely Hamza has also two shapes; the independent and the middle one. Figure 2 gives these four shapes for the second character of Urdu i.e. the character Bay. Urdu is traditionally written in Nastaleeq, a script rich in calligraphic content. Owing to complexities of rendering, the basic shapes identified above are unable to render the language in an acceptable form of Nastaleeq. The characters of Urdu also need diacritics to help in the proper pronunciation of the constituent word. The diacritics appear above or below a character to define a vowel or emphasize a particular sound. There are a number of diacritics, the common ones being Zabar, Zeir, and Pesh. Figure 3 shows the character Bay marked with these diacritics. Figure 4 shows Urdu text in Nastaleeq script with diacritics placed on the respective characters. Diacritics, though part of the language, are sparingly used. They are essential for removal of ambiguities, natural language processing and speech synthesis. Thus, (a) the multiple shapes of characters, (b) the complexities of the traditional script of Urdu and (c) the existence of diacritics, are major factors that contributed to the difficulties in formulating a standard code table for Urdu.
3.INTERNAL AND EXTERNAL REPRESENTATION
Prior to computer processing, all of the shapes of a character were made available to the user of printing machinery. The typewriter keyboard had all these shapes and so had the "letter type" printing presses. This complicated the text entry besides reducing the speed of output. The computer possesses the capability to generate these shapes through software. Hence a fundamental decision was taken to internally represent only the independent shape of a character and to accommodate only these in the proposed code table. The context sensitive shapes and fonts were considered external representation to be handled by software.
4. REQUIREMENT SPECIFICATIONS FOR CODE TABLE
At the onset certain guidelines were defined for formulating the code table. Firstly, it was laid down that the code table should satisfy and fulfill the linguistic requirements and peculiarities of the language, rather than the language or its style being modified to ease the design of the code table. Secondly, the code table should implement the sorting sequence as specified by the Urdu Language Authority. Thirdly, it should support diacritics. Fourthly, it should support the calligraphic traditions of the language. Fifthly, it should be compatible with Unix and Windows platforms. Sixthly, it should facilitate application development and information exchange. Finally the code table should be able to support future enhancements.
5. SORTING ISSUES
The code table must fulfill the basic need to sort the words of its parent language. In the design of the code table, a two-pass sorting scheme prescribed by National Language Authority is followed. In the first pass, sorting is done on the basis of the characters alone, and the diacritics are ignored. The position of the characters in the code table controls sorting at this level. In the second pass, diacritics control any additional sorting, if required. The sorting in the second pass is to be carried out through software, however. For example, if two words are identical as regards characters but differ in diacritics then their relative sort order is determined by the diacritics, with comparison of diacritics beginning from the first character in each of the two words. The above implies that the code table by itself would be able to correctly sort text without diacritics, but would need software to sort text including diacritics. A similar situation exists in case of English. The ASCII table is capable of sorting according to ASCII collating sequence. To implement the sequence of a dictionary (i.e. lexical sort), an algorithm is required.
6. FEATURES OF CODE TABLE
In the past Urdu diacritics were not supported either by mechanical printing systems or by computer text processing systems. Representation of diacritics is necessary for machine translation, natural language processing and speech synthesis. For instance, the characters Bay and Noon can be joined to form the word Bay-Noon which may be pronounced [bW n](become), [bin] (without), [bun] (sew). The reader knows the correct sound and meaning of such words from the context. (An example to illustrate the above from English is the word "read" in two sentences: I have read the book. I will read it.) However for machine translation it is essential to incorporate the diacritics. Similarly for speech synthesis it is mandatory to include diacritics to enable the correct sound to be generated by the machine.
The proposed code table employs an 8-bit code and is specific for Urdu in that it is not designed to coexist with the ASCII table. A language-shift toggle is introduced that can alter the entry of text to any other language. There are two control characters blocks in the code table, each of 32 characters, in positions compatible with ASCII. This enables the lower 128 characters as well as the upper 128 characters, each to be represented by a 7-bit code. Setting the most significant bit to 0 accesses the lower block and changing it to 1 accesses the upper block.
Another feature of the code table is the provision of vendor and expansion areas. The vendor can define custom characters but these shall not form part of the standard. The expansion block provides for authorized addition of new characters, if needed.
7. DIVISIONS OF CODE TABLE
The complete code table shown in Figure 5, contains 256 characters divided into 10 blocks as under:
The standardization of Urdu code table is a major development. It is first time that all the requirements of the language have been addressed in such a comprehensive manner. It is now important for software developers to use the new code table for application development. The Government and standardization agencies need to take measures to implement the code table. For the future, work needs to be done to propose an Urdu code table for 16-bit Unicode standard.
The author acknowledges the contribution by other members of the Committee in the formalization of the code table, namely: Mr. Asad Kamal Abbasi, Dr. Attash Durrani, Mr. Humayun Qureshi, Dr. Mohammad Afzal, Mr. Nadeem Malik, Dr. Sarmad Hussain, Mr. Tahir Mufti, Mr. Tariq Hameed and Mr. Tauqir Ghani. The author also acknowledges the cooperation and support extended by a number of organizations including FAST, KRL and National Language Authority, Pakistan.Figure 1-Figure4