About BhashaIndia | Contribute | SiteMap | Register | Sign in to Windows Live ID
  Developers Patrons
Hindi Tamil Kannada Gujarati Marathi Telugu Bengali Malayalam Punjabi Konkani Oriya Sanskrit Nepali
Home > Developers > Tutorial > dotnet > HTML and XML Welcome Guest!

Internationalization in HTML and XML

By Aparna Ravindran

The World Wide Web provides the free flow of information across the borders of different countries. The Internet suffers now expect that the information on the Web should be available in their own culture as the standard method to exchange the information in the web is the English language. Hyper Text Markup Language (HTML) is such a standard, containing tags, attributes and data in the form of English words or English abbreviations. eXtensible Stylesheet Language (XSL) is another example to express the transformation between two eXtensible Markup Language (XML) documents.
Hence, both HTML and XML allows you to display the Web content only in English language. The only solution is to internationalize the HTML and XML pages.
HTML internationalization
HTML internationalization can be addressed based on the following:
    • Character set encoding
    • Forms internationalization
    • Language-specific presentation
HTML documents can be encoded in any character set appropriate to a language that is interpretable by the targets such as browsers, palmtops, mobile phones, etc. But, the targets should be informed about the character set chosen to display the information. So, it is suitable to choose the standardized and accepted character set encoding formats such as Unicode (UTF-8, UTF-16), ISO 646, ISO 8859 series, ISO 10646, etc (ISO 8859 series is a subset of Unicode).
You can specify the document code (character encoding used in the document) in the <META> tag, the sub-tag of the <HEAD> element, as follows:
<META HTTP-EQUIV =”content-type” CONTENT=”text/html;charset = ISO-10646”>
where, the charset ISO-10646 is for Indic languages. You can replace the ISO-10646 with the MIME name of your character set. The “text/html” specifies that the content of a Web page will be in HTML format. But, not all standards support HTML formats. Protocols such as SMTP and POP3 does not accept any content type but for plain text. MIME standard is developed to extend the capability of e-mails to send any type of data. Apart from using document encoding character set, you can also use external character encoding for the document.
It is also necessary to take into account the internationalization of HTML forms. When uploading files to the Web, the forms mechanism of HTML should allow the negotiation of data submitted to the server. To enable this, the form tag has an attribute named ACCEPT-CHARSET (supported from HTML 4.0). The purpose of this attribute is to specify the character encoding format supported by the server. The value for this attribute is selected from the list of MIME types.
When submitting the form to the server, the problem encountered is related to the submission of encoded data along with the form. The form can be posted to the server by using POST method or GET method. When POST method is used for form posting, MIME type is used. You can specify the following in the header field before sending any data using the POST method:
“Content-Type: application/x-www-form-urlencoded"
When GET method is used to submit the form, data is encapsulated along with the URL that is sent to the server. But, this method does not provide any mechanism to send the encoded data along with the URL.
Apart from charset encoding and form submission, the language in which the Web page is displayed is also an interesting issue. A new attribute named LANG (supported from HTML 4.0 version) is introduced. This can be used with almost all HTML elements to specify the language in which the content in these elements will fall in. For example, consider the following <P> element:
<P LANG=”hi”> ………</P>
The language for the <P> element is specified as Hindi. So, the content of this element will be displayed in Hindi. Also, when a language is specified, you need to account for the writing style, quotations, alignments and hyphenations, etc, which helps in internationalizing the document.
XML internationalization
XML is flexible and allows you to store any kind of data such as corporate information, and system documents and exchange any kind of data between different environments. Hence XML internalization focuses on:
    • Character set
    • Digits
    • Writing direction
There are different kinds of character sets available, of which Alphabets, Ideographs and Syllabic notation are most used ones. Differences in writing system are also found in the representation of digits. They differ significantly for each language. Apart from these, the difference in writing direction also differs significantly.
The process of internationalizing the XML documents can be done with the help of the eXtensible Stylesheet Language Transformation (XSLT). The stylesheet allows you to translate the document from the source language to any target language according to the specification of the auxiliary dictionary document.
The following steps describe the methods to translate an XML document using XSLT:
  1. Declare the stylesheet along with the encoding format of the target language.
  2. Specify the source and the target language along with the namespace declaration.
  3. Specify the output format for the target language.
  4. Specify the template tag to access the elements and their child elements. Each node accessed is entered into the auxiliary dictionary. The dictionary is responsible for identifying the source language and replaces that node entry with the target language.
  5. Specify the apply-template element when a template pattern needs to be applied to a nested element. You also need to use call-template element. The difference of the above two elements is that the former works for any context of the XML document but the latter is associated within the context.
  6. Specify the translate-element template to translate the attribute name and its values from the dictionary.
  7. Specify the filter template for each value of the attribute. If the value happens to be a function, then this template will replace every occurrence of the function with the target language.
As a result, the XML document is converted into a specific target language.
There are some common features such as character set, writing direction, etc for both HTML and XML like. The specific features for HTML and XML allow you to internationalize them according to your specifications.

Partner Profile | Privacy Statement | Why Passport | Testimonials
This site uses Unicode for non-English characters and uses Open Type fonts.
©2003-2007 Microsoft Corporation. All rights reserved.