Unicode Need for Computer
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8SfZEwG_PArTDSEH2fPJNrPWF4WZIbX4fcGHkxDtNfPYhGpk5dNvrkxjsPDDNHyIG0ym8xVqzUu1edcca5JCmWyePPc3DvvNzYFkHBigtLneMEd23wPXMKya8VZa2WceyX_RbXU8v45s/s640/cropped-my-galaxy-s4-wallpaper-hd-funny_114+copy.jpg)
Need for Unicode
As computers become a popular tool for doing all kinds of data processing across the world, their usage could not
be limited to English language users only .Hence, people started developing computer
systems that could allow interaction and processing of data in local languages of
users(e. g ,'Hindi, Japanese. Chinese, Korean,
etc .) This required support of local language characters and other language specific
symbols on these computer systems, ASCII or EBCDlC did not have enough number of bits to
accommodate all the characters and language-specific symbols of a local language.
in addition to English alphabet characters and special characters. Hence different
encoding systems were designed to cater to this requirement of different local
languages. In the process, hundreds of different encoding systems came in to existence.
Although, this looked fine initially, it later led to a chaotic state of
affairs due to following reasons: h I. No single encoding system had enough bits
and an adequate mechanism to support characters of all types of languages used in
the world. Hence supporting of characters from multiple languages on a single
computer system became a tedious job since it required supporting of multiple encoding
systems on the computer with hundreds of different encoding systems in use
across the world. It became almost impossible to support all of the Mona single
system. 2 Different encoding systems developed
independently of each other obviously conflicted with one another. That is two encoding
systems of ten used the same code for two different characters or used different
codes for the same character. Due to this problem. Whenever data transfer took
place between computer systems or software using different encoding systems the
data was at the risk of corruption. As a result it became difficult to exchange
text files internationally. The Unicode standard was designed to overcome these
problems. it is a universal character-encoding standard used for representation
of text for computer processing. The official Unicode website (mew.tmicode.org/stattdatd/Wtatls
Unicode Jmnl ) states that it is an encoding system that provides a unique
number for every character no matter what the platform, no matter what the program,
no matter what the language". Unicode Features r Today Unicode is a
universally accepted character-encoding standard because:
1: it provides a consistent way of encoding multilingual plain
text. This enables data transfer through different systems without the risk of
corruption.
2: it defines codes for
characters used in all major languages of the world used for written communication.
This enables a single software product to target multiple platforms, languages,
and countries without RC engineering.
3: it also
defines codes for special characters (such as various types of punctuation marks). Mathematical
symbols, technical symbols, and diacritics. Diacritics are modifying character
marks such as tilde (~). that are used in conjunction with base characters to
represent accented letters (indicating different sound -for example, n).
4: it has
the capacity to encode as many as a million characters. This is large enough
for encoding all known characters including all historic scripts of the world
as it ell as common notational systems.
5: It
assigns each character a unique numeric value and name keeping character coding
simple and efficient.
6: It
reserves a pan of the code space for private use to enable users to assign
codes for their own characters and symbols.
7: It
affords simplicity and consistency of ASCII. Unicode characters that correspond
to the familiar ASCII character set have the same byte values as that of ASCII.
This enables use of Unicode in a convenient and backward compatible manner in
environments designed entirely around ASCII. Like UNIX. Hence, Unicode is
usable with existing software without extensive software rewrites.
8: It
specifies an algorithm for presentation of text with bi-directional behavior.
For example, it can deal with a text containing a mixture of English (which
uses left-to-right scripts) and Arabic (which uses right to-left scripts). For
this, it includes special characters to specify changes in direction when
scripts of different directions are mixed. For all scripts, Unicode stores a
text in logical order within the memory Representation corresponding to the
order of typing on the keyboard.
As mentioned
earlier. Unicode has a lot of room to accommodate new characters. Moreover, its
growth process is strictly additive in the sense that new characters can be
added easily but existing characters cannot be removed. This feature ensures
that interpretation of data once encoded in Unicode standard will remain in the
same way by all future implementations that conform to original or later
versions of the Unicode standard.
Unicode
Encoding Forms
In addition
to defining the identity of each character and its numeric value (also known as
code point) . Character encoding standards also define internal representations
(also known as encoding forms) of each character (how its value is represented
in bits). Unicode standard defines following three encoding forms for each
character:
1: UTF-8
(Unicode Transformation Format-8). This is a byte-oriented format having all
Unicode characters represented as a variable length encoding of one, two, three
or four bytes (remember, 1 byte = 8 bits). This form is useful for dealing with
environments designed entirely around ASCII because the Unicode characters that
correspond to the familiar ASCII character set have the same byte values as
that of ASCII. This form is also popular for HTML and similar protocols.
2: UTF-I6
(Unicode Transformation Format-l6) . This is a word-oriented format having all
Unicode characters represented as a variable length encoding of one or two
words (remember. l word = 16 bits). This form is useful for environments that
need to balance efficient access to characters with economical use of storage,
This is because all the heavily used characters can be represented by and
accessible via a single word (16-bit code unit), while all other characters are
represented by and accessible via a pair of words. Hence, this encoding form is
reasonably compact and efficient yet provides support for larger number of
characters.
3: [HF-32
(Unicode Transformation Format~32) . This is a double-word oriented format
having all Unicode characters
represented as a fixed length encoding of two words (remember. I word =16
bits). That is. a double word (32~bit code unit) encodes each character. This
form is useful for environments where memory space is not a concern but fixed
width (single code unit) access to characters is desired.
Notice that
at most 4 bytes (32 bits) are required for each character in all three forms of
encoding. With the three forms of encoding supported. Transmission of the same
data in a byte, Word, or double-word format ('1 e , in 8, 16, or 32-btts per
code unit) is possible depending on its usage environment. Unicode standard
also provides the facility to transform Unicode encoded data from one form to
another without loss of data.
COLLATING
SEQUENCE
Data processing
operations often use alphanumeric and alphabetic data elements. Obviously; we
do not intend In perform any arithmetic operation on such data, but we
frequently perform comparison operations on them, like arranging them in some
desired sequence. Now, if we want a computer to compare alphabetic values A and
B, which one will the computer treat as greater Answering such questions
necessitates having some assigned ordering
among the characters used by a computer. This ordering is known as collating sequence
Data processing operations mostly adopt case sensitive dictionary collating
sequence .
Collating
sequence may vary from one computer system to another depending on the type of
character coding scheme (computer code) used by a computer. To illustrate this
let us consider the computer codes already discussed in this chapter Observe
from Figures 4.2 and 1.3 that the zone values of characters A through 9
decrease in BCD code from the equivalent of decimal 3 down to 0 while the zone
values of characters A through 9 increases in EBCDlC from the equivalent of
decimal 12 to 15. Hence, a computer that uses BCD code will treat alphabetic
characters (A, B, Z) to be greater than numeric characters (0, l, ...9), while
a computer that uses EBCDIC will treat numeric characters to be greater than
alphabetic characters: Similarly, observe from Figures 4.6 that during a sort
(ascending), a computer using ASCII will place numbers ahead of letters because
numbers have values less than that for letters. Due to similar reason, it will
place uppercase letters ahead of lowercase letters.
However,
whatever may be the type of computer code used, in most collating sequences,
following rules are observed:
1: Letters are considered in alphabetic order (A
< B < C < < Z) and (a < b s c < < z)
2: Digits are considered in numeric order (0 <
t < 2 < < 9)
EXAMPLE 4.8
A computer uses EBCDIC as its intemal
representation of characters. In which order will this computer sort the
strings 23, Al, and 1A?
![]()
Solution:
In EBCDlC, numeric characters have
values greater than alphabetic characters
Hence, the said computer will place
numeric characters after alphabetic characters. causing sorting (tithe given
strings as:
A1<1A<23
Therefore, the sorted sequence will
be: A 1, 1A, and 23
|
Comments
Post a Comment