The world is turning local. Are you?
An Indian perspective to Unicode and Localisation

Unicode: a smarter, more powerful encoding

Being a 16 bit encoding standard, Unicode provides a unique number to represent all characters of all languages (a few exceptions do exist however), and these languages are not just a few modern day languages. Unicode supports even those ancient languages which are not in regular use any more (such as Pali and Prakrit).

By Balendu Sharma Dadhich 17/04/06

Different people use a computer for different purposes. Computers are an astonishing machine that can carry out tasks entirely different in nature, from mathematical computations to video processing, from email-exchange to development of graphics and animation, and even controlling complex factories and plants. However, whatever we do with the computer, it essentially deals with data, which is used for input, output, storage, display and processing etc.

To us, data can represent itself in different forms such as sound, text, pictures, videos and animation, but to a computer it is all numbers. As you must be aware, a computer fundamentally stores data as numbers. To assign these numbers to different forms of data, and to interpret them, we use a system called encoding. For example, in ASCII encoding a value of ‘65’ is assigned to the character ‘A’, which is stored in the computer hard disk as binary digits ‘01000001’. Similarly, the digit 1 is assigned the number ‘49’ and is stored in the system as binary digits ‘00110001’.

Unicode is one such character encoding standard (there are other encodings as well such as video and audio encoding) that provides a mechanism to store data belonging to almost all languages of the world in a common, scientific and cohesive format. It is a 16 bit encoding system, the largest in the world, which provides support for more than 65536 characters (In Unicode 5.0.0 it has reached close to 90,000). Since this number is sufficient to include most of world’s languages’ characters, Unicode encoding has accommodated all of them in a single standard which has paved way for development of multilingual systems and solutions.

Till a few years ago, we had lots of encoding standards including the popular ASCII to assign numbers to characters. All these encodings had limitations with regard to number of characters supported. For example, seven bit or eight bit encoding standards such as ANSI, ASCII or ISCII had limitations of using a maximum of 128 or 256 characters. This was not sufficient to contain even a single language’s complete character set, leave alone multiple languages. A computer using them was a contained powerhouse, limited to just one language. We could use other languages on the computer, however, using some non-scientific workarounds.

Different encodings sometimes interpreted the same number as different characters. Existence of multiple, conflicting encodings created innumerable problems as data transferred between computers supporting different encoding standards lost its meaning.

Unicode is the solution to these problems. Being a 16 bit encoding standard, it provides a unique number to represent all characters of all languages (a few exceptions do exist however), and these languages are not just a few modern day languages. Unicode supports even those ancient languages which are not in regular use any more (such as Pali and Prakrit). It supports non-spoken script such as mathematical, scientific and commercial symbols, and even has scope for adding new languages in the future. Surely, Unicode is the standard for the future.

Compared to older ways of handling character and string data, Unicode simplifies localization of software and improves multilingual text processing. By using Unicode to represent character and string data in your applications, you can enable those applications with universal data exchange capabilities for global marketing, using a single binary file for every possible character code. Some features of Unicode that make it preferable to use are:

-Unicode allows any combination of characters, drawn from any combination of scripts and languages, to co-exist in a single document.

-Unicode defines semantics for each character.

-Unicode standardizes script behavour.

-Unicode provides a standard algorithm for bi-directional text.

-Unicode defines cross-mappings to other standards.

-Unicode defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16 and UTF-32. Conversion of data among these encodings is fully safe.

Welcome Unicode 5.1: the mightier one

Arrival of Unicode 5.0.0: It is 99,000 characters strong now

Unicode is IT’s contribution to Globalisation

An effort to promote unhindered use of Indian languages in Information Technology
Copyright:
localisationlabs.com. 2006. Since: March, 2006.
A website by Balendu Sharma Dadhich.