Unicode Need for Computer

Need for Unicode

As computers become a popular tool for doing all kinds of data processing across the world, their usage could not be limited to English language users only .Hence, people started developing computer systems that could allow interaction and processing of data in local languages of users(e. g ,'Hindi, Japanese. Chinese, Korean, etc .) This required support of local language characters and other language specific symbols on these computer systems, ASCII or EBCDlC did not have enough number of bits to accommodate all the characters and language-specific symbols of a local language. in addition to English alphabet characters and special characters. Hence different encoding systems were designed to cater to this requirement of different local languages. In the process, hundreds of different encoding systems came in to existence. Although, this looked fine initially, it later led to a chaotic state of affairs due to following reasons: h I. No single encoding system had enough bits and an adequate mechanism to support characters of all types of languages used in the world. Hence supporting of characters from multiple languages on a single computer system became a tedious job since it required supporting of multiple encoding systems on the computer with hundreds of different encoding systems in use across the world. It became almost impossible to support all of the Mona single system. 2 Different encoding systems developed independently of each other obviously conflicted with one another. That is two encoding systems of ten used the same code for two different characters or used different codes for the same character. Due to this problem. Whenever data transfer took place between computer systems or software using different encoding systems the data was at the risk of corruption. As a result it became difficult to exchange text files internationally. The Unicode standard was designed to overcome these problems. it is a universal character-encoding standard used for representation of text for computer processing. The official Unicode website (mew.tmicode.org/stattdatd/Wtatls Unicode Jmnl ) states that it is an encoding system that provides a unique number for every character no matter what the platform, no matter what the program, no matter what the language". Unicode Features r Today Unicode is a universally accepted character-encoding standard because:

1: it provides a consistent way of encoding multilingual plain text. This enables data transfer through different systems without the risk of corruption.

2: it defines codes for characters used in all major languages of the world used for written communication. This enables a single software product to target multiple platforms, languages, and countries without RC engineering.

3: it also defines codes for special characters (such as various types of punctuation marks). Mathematical symbols, technical symbols, and diacritics. Diacritics are modifying character marks such as tilde (~). that are used in conjunction with base characters to represent accented letters (indicating different sound -for example, n).

4: it has the capacity to encode as many as a million characters. This is large enough for encoding all known characters including all historic scripts of the world as it ell as common notational systems.

5: It assigns each character a unique numeric value and name keeping character coding simple and efficient.

6: It reserves a pan of the code space for private use to enable users to assign codes for their own characters and symbols.

7: It affords simplicity and consistency of ASCII. Unicode characters that correspond to the familiar ASCII character set have the same byte values as that of ASCII. This enables use of Unicode in a convenient and backward compatible manner in environments designed entirely around ASCII. Like UNIX. Hence, Unicode is usable with existing software without extensive software rewrites.

8: It specifies an algorithm for presentation of text with bi-directional behavior. For example, it can deal with a text containing a mixture of English (which uses left-to-right scripts) and Arabic (which uses right to-left scripts). For this, it includes special characters to specify changes in direction when scripts of different directions are mixed. For all scripts, Unicode stores a text in logical order within the memory Representation corresponding to the order of typing on the keyboard.

As mentioned earlier. Unicode has a lot of room to accommodate new characters. Moreover, its growth process is strictly additive in the sense that new characters can be added easily but existing characters cannot be removed. This feature ensures that interpretation of data once encoded in Unicode standard will remain in the same way by all future implementations that conform to original or later versions of the Unicode standard.

Unicode Encoding Forms

In addition to defining the identity of each character and its numeric value (also known as code point) . Character encoding standards also define internal representations (also known as encoding forms) of each character (how its value is represented in bits). Unicode standard defines following three encoding forms for each character:

1: UTF-8 (Unicode Transformation Format-8). This is a byte-oriented format having all Unicode characters represented as a variable length encoding of one, two, three or four bytes (remember, 1 byte = 8 bits). This form is useful for dealing with environments designed entirely around ASCII because the Unicode characters that correspond to the familiar ASCII character set have the same byte values as that of ASCII. This form is also popular for HTML and similar protocols.

2: UTF-I6 (Unicode Transformation Format-l6) . This is a word-oriented format having all Unicode characters represented as a variable length encoding of one or two words (remember. l word = 16 bits). This form is useful for environments that need to balance efficient access to characters with economical use of storage, This is because all the heavily used characters can be represented by and accessible via a single word (16-bit code unit), while all other characters are represented by and accessible via a pair of words. Hence, this encoding form is reasonably compact and efficient yet provides support for larger number of characters.

3: [HF-32 (Unicode Transformation Format~32) . This is a double-word oriented format having all Unicode characters represented as a fixed length encoding of two words (remember. I word =16 bits). That is. a double word (32~bit code unit) encodes each character. This form is useful for environments where memory space is not a concern but fixed width (single code unit) access to characters is desired.

Notice that at most 4 bytes (32 bits) are required for each character in all three forms of encoding. With the three forms of encoding supported. Transmission of the same data in a byte, Word, or double-word format ('1 e , in 8, 16, or 32-btts per code unit) is possible depending on its usage environment. Unicode standard also provides the facility to transform Unicode encoded data from one form to another without loss of data.

COLLATING SEQUENCE

Data processing operations often use alphanumeric and alphabetic data elements. Obviously; we do not intend In perform any arithmetic operation on such data, but we frequently perform comparison operations on them, like arranging them in some desired sequence. Now, if we want a computer to compare alphabetic values A and B, which one will the computer treat as greater Answering such questions necessitates having some assigned ordering among the characters used by a computer. This ordering is known as collating sequence Data processing operations mostly adopt case sensitive dictionary collating sequence .

Collating sequence may vary from one computer system to another depending on the type of character coding scheme (computer code) used by a computer. To illustrate this let us consider the computer codes already discussed in this chapter Observe from Figures 4.2 and 1.3 that the zone values of characters A through 9 decrease in BCD code from the equivalent of decimal 3 down to 0 while the zone values of characters A through 9 increases in EBCDlC from the equivalent of decimal 12 to 15. Hence, a computer that uses BCD code will treat alphabetic characters (A, B, Z) to be greater than numeric characters (0, l, ...9), while a computer that uses EBCDIC will treat numeric characters to be greater than alphabetic characters: Similarly, observe from Figures 4.6 that during a sort (ascending), a computer using ASCII will place numbers ahead of letters because numbers have values less than that for letters. Due to similar reason, it will place uppercase letters ahead of lowercase letters.

However, whatever may be the type of computer code used, in most collating sequences, following rules are observed:

1: Letters are considered in alphabetic order (A < B < C < < Z) and (a < b s c < < z)

2: Digits are considered in numeric order (0 < t < 2 < < 9)

EXAMPLE 4.8

A computer uses EBCDIC as its intemal representation of characters. In which order will this computer sort the strings 23, Al, and 1A?

Solution:

In EBCDlC, numeric characters have values greater than alphabetic characters

Hence, the said computer will place numeric characters after alphabetic characters. causing sorting (tithe given strings as:

A1<1A<23

Therefore, the sorted sequence will be: A 1, 1A, and 23

Search This Blog

Pc-knowledge

Unicode Need for Computer

Comments

Post a Comment

Popular posts from this blog

باس کمپیوٹر کا تنظیم

बेसिक कंप्यूटर संगठन