About BhashaIndia | Contribute | SiteMap | Register | Sign in to Windows Live ID
  Developers Patrons
Hindi Tamil Kannada Gujarati Marathi Telugu Bengali Malayalam Punjabi Konkani Oriya Sanskrit Nepali
Home > Developers > KnowHow > CollationIntro > Linguistic 'Characters' Welcome Guest!

Linguistic 'Characters'

By Cathy Wissink & Michael S.Kaplan - Windows Globalization, Microsoft Corporation

Linguistic characters are those characters that, for lack of a better term, a speaker recognizes as belonging to the language. (Technically, this is defined as a grapheme, but for simplicity's sake, it is known as linguistic character.) It is important to note that not all linguistic characters are a single code point in an encoding (or in Unicode), as the following European examples demonstrate:

  • Spanish CH: consists of two code points C (U+0043) and D (U+0044) and sorts as a unique character between C and D;
  • Hungarian DZ: sorts as a unique character between D (U+0044) and DZS (note that DZS is also a unique character for Hungarian in its own right, sorting before E and after DZ);
  • Norwegian Ø (U+00D8): sorts as a unique character after Z (U+005A) and Ä (U+00C4). (This particular character could also be rendered as a combination of O [U+004F] and a non-spacing slash [U+0338].)

Speakers of a language expect words (or strings) in a list to be grouped according to the linguistic characters of their language. There may be some variations on these characters (for example, a letter is upper or lower case, or has a diacritic modifying the character), but these are the core characters by which words in a language are ordered. To further clarify the concept of these core characters, see the following list:
apple
Apple
Are
ban
banana
Banana

Notice that there is a significant enough difference in sorting between the letters A and B, such that all the strings beginning with A come before all the strings beginning with B. This difference is referred to in collation as weight. The weight of linguistic characters takes precedence over all other weights we will discuss later in the paper. In addition, the weight of a linguistic character is often called a primary weight.

It is also important to note that what determines a linguistic character (and respectively a collation primary weight) depends on the writing system and the specific language. In alphabets, the linguistic character is generally a letter. In ideographic languages (e.g., Chinese, Japanese, Korean), the linguistic character is determined by a number of factors far beyond the scope of this paper; suffice it to say that primary weighting for the linguistic characters in the various languages can be based on the main radical within the character, the stroke count of the character, or even pronunciation (e.g., Taiwanese "Bopomofo"). Some languages with inherent vowels (e.g., Devanagari-script languages, Tamil) will have particular modifiers to determine linguistic characters and primary weights.

Partner Profile | Privacy Statement | Why Passport | Testimonials
This site uses Unicode for non-English characters and uses Open Type fonts.
©2003-2007 Microsoft Corporation. All rights reserved.