Recent Articles



































Han unification



         


Unicode
series
Unicode
Unicode Consortium
UCS
UTF-7
UTF-8
UTF-16
UTF-32
SCSU
Punycode
Bi-directional text
BOM
Han unification
Unicode and HTML

The Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified glyphs. The Chinese characters are common to Chinese (where they are called "hanzi"), Japanese (where they are called kanji), and Korean (where they are called hanja). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these differences were folded. This unification is referred to as "Han Unification", with the resulting character repertoire sometimes referred to as Unihan.

An article by IBM has a good explanation of this issue :

The problem stems from the fact that Unicode encodes characters rather than "glyphs," which are the visual representations of the characters. There are four basic traditions for East Asian character shapes: traditional Chinese, simplified Chinese, Japanese, and Korean. While the Han root character may be the same for CJK languages, the glyphs in common use for the same characters may not be.
For example, the traditional Chinese glyph for "grass" uses four strokes for the "grass" radical, whereas the simplified Chinese, Japanese, and Korean glyphs use three. But there is only one Unicode point for the grass character (, U+8349) regardless of writing system. Another example is the ideograph for "one"(壹 壱 一), which is different in Chinese, Japanese, and Korean. Many people think that the three versions should be encoded differently.

A slight difference in rendering characters might be a serious problem. Besides a simple nuisance like Japanese text looking like Chinese, names might be displayed as a different character — the same character in the sense of encoding but a different character in the view of the users. This rendering problem is often employed to criticize westerners for not being aware of subtle distinctions.

The process of Han unification was very controversial with most of the opposition coming from the Japanese. Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible. Proponents of Han unification state that the Unicode BMP set of unified characters is "good enough" for almost all everyday uses of the languages that use these scripts, that Unicode 3.1 greatly extends this repertoire for academic and literary needs, and that other encodings are also available for specialist academic purposes.

Much of the controversy surrounding Han unification seems to be based on a misunderstanding of what the Unicode standard defines. Unicode defines what it calls graphemes, which are logical characters, as opposed to glyphs, which are particular visual representations of that character. One grapheme may be represented by many glyphs, for example an Arabic letter, which is represented by a single Unicode character but is displayed differently depending on whether it occurs at the beginning, middle or end of a word, or by itself. Unicode publishes charts with pictures for each character, but these are illustrations only and do not mandate the character's shape. References like below seem to assume that what the Unicode standard pictures is how each character must be displayed, and protest when it doesn't match the local appearance of the character. The way things are supposed to work is that a Japanese user will have a font with Japanese-style characters, a Chinese user will have a font with Chinese-style characters, etc., and everyone will see the "right" characters for them. Problems are introduced when several languages must be represented in the same document. This can be worked around with higher-level markup defining the language used for each string of characters, although this is cumbersome and may not always work correctly; see the demonstration below.

Note that most of the opposition to Han unification appears to be Japanese and there has been very little opposition from Chinese speakers. Unlike either of the current systems (GB 2312 and Big5), Han unification encode chinese characters in glyphs, as opposed to graphemes, avoids the politically charged issues of simplified versus traditional characters (e.g. the ideograph for "dragon" is 龍 U+9F8D for Traditional Chinese, 龙 U+9F00 for Simplifed Chinese.) , and unlike the alternatives, Unicode is widely seen as politically neutral.

Specialist character sets developed to address, or regarded as not suffering from, these perceived deficiences include:

However, none of these alternative standards have been as widely adopted as Unicode.

[Top]

Check your browser:

The following table contains identical characters in all five rows, but each row is marked (via an HTML attribute) as being in a different language: Chinese (3 varieties - unmarked "Chinese", simplified characters, and traditional characters), Japanese, or Korean. So ideally your browser should select fonts and glyphs that suit each language better. See if it really happens.


Chinese (generic)
Chinese (Simplified)
Chinese (Traditional)
Japanese
 
Korean
 


[Top]

See also:







  View Live Article   This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License