BhashaIndia.com :: Chat Transcript #1

Hindi

Tamil

Kannada

Gujarati

Marathi

Telugu

Bengali

Malayalam

Punjabi

Konkani

Oriya

Sanskrit

Nepali

Home > Patrons > Events > Chat Transcript #1

Welcome Guest!

BhashaIndia Expert Chat #1
Chat Topic : Encoding (legacy and Unicode)
Chat Expert : Bob Eaton (MVP)

The first of the BhashaIndia Expert Chats took place on the 4th of November 2005, between 11 A.M. and 1:00 PM.

Bob Eaton, one of Microsoft’s Most Valued Professionals fielded a number of queries related to the following topics.

Encoding (legacy and Unicode) and encoding conversions (with respect to Devanagari issues)
How to create COM objects for use in scripting languages like Microsoft Word VBA?
How to make a VC++(nominally version 6.0) program support Unicode?
The chat was conducted from the Microsoft facility in Bangalore, India

Below is the transcript of the chat:

BhashaIndiaModerator (Expert) : It's my privilege to introduce Bob Eaton (Microsoft MVP), an expert who has answered many a query posed by you all on the Microsoft BhashaIndia Discussion forums. Just a small introduction before he starts fielding your questions and queries. Bob's a graduate in Electrical Engineering with a Master's Degree in Computer Science; he began his career as an embedded systems developer. A chance visit to India enamored and planted the seed for a sustained interest in the field of linguistics and its merger with Computing. With the advent of Unicode, he immersed himself in converting programs to support Unicode. Currently pursuing his Ph.D. in Linguistics from the University of Texas, Arlington with Kangri as his research subject, Bob has been actively involved with Indic Computing. He has been instrumental in developing SIL Converters that help users convert characters in legacy-encoded fonts to Unicode.

BhashaIndiaModerator (Expert) : Mr. Eaton will be fielding queries on three primary issues on the BhashaIndia Chat Expert Chat.
First and mainly:

Encoding (legacy and Unicode) and encoding conversions (with respect to Devanagari issues)

Following that we will discuss:

How to create COM objects for use in scripting languages like Microsoft Word VBA?
How to make a VC++(nominally version 6.0) program support Unicode?

BhashaIndiaModerator (Expert) : Some basic rules:

BhashaIndiaModerator (Expert) : Please refrain from sending any private messages to the expert during the chat

BhashaIndiaModerator (Expert) : This chat will last for one hour. During this hour, our Experts will respond to as many questions as they can. Please understand that there may be some questions we cannot respond to due to lack of information or because the information is not yet public. We encourage you to submit questions for our Experts.

BhashaIndiaModerator (Expert) : We ask that you stay on topic for the duration of the chat. This helps the Guests and Experts follow the conversation more easily. We invite you to ask off topic questions after this chat is over

BhashaIndiaModerator (Expert) : Please use the radial button "submit a question " to ask any questions to the expert

BhashaIndiaModerator (Expert) : Dear all, Please use the radial button "submit a question " to ask any questions to the expert

BhashaIndiaModerator (Expert) : The lower right corner, next to the chat input box, is a check box.

BhashaIndiaModerator (Expert) : Please use that for your questions to be logged as a question to the expert

Bob Eaton (Expert) :
Q: Are you there Bob?
A: Yes, I'm here. We're waiting for questions via the "Ask the Expert" check box to the right of your send button

Bob Eaton (Expert) :
A: On the question: What're the encodings available for Indian Languages Unfortunately, there are many many ch is wse an "encoding" means a definition between characters and code points. So in the font I used to use (a.k.a. "Legacy" font), the letter क is at code point 0045, in the Susha font, it might be at code point 0072, in TTYogesh, it might be at code point 0088, etc. So if you try to show some text in one font, it won’t show in another. You have to have the font that was originally used to create the data.

Bob Eaton (Expert) :
A: Hi Mohan

Bob Eaton (Expert) :
A: With the Unicode "encoding", *every* font has that letter at the code point 0x0915

Bob Eaton (Expert) : This way, you can switch from any Unicode enabled font that you want and your text will still mean the same thing

Bob Eaton (Expert) : In terms of "conversion", the issue is how do you go from 0045 or 0072, or 0088 to 0915?

Bob Eaton (Expert) : Unfortunately, the only Indic encoding that has a code page is ISCII

Bob Eaton (Expert) : This means that if you have text in Susha, or TTYogesh or Annapurna, you have to have a "non-code page" solution for encoding conversion.

Bob Eaton (Expert) : Some folks in SIL have developed several tools for doing this and I've put "COM/.Net" wrappers around them to make them easier to use from programming tools, Word VBA, and such...

Bob Eaton (Expert) : This is how I got involved in the two main topics of this chat: encodings, conversion, VC++ conversion to support Unicode, Word VBA, etc.

Bob Eaton (Expert) : I've noticed that some of you have questions about Java (which to me is just what I drink in the morning and other topics... unfortunately, that's outside my area of understanding. Perhaps there'll be another expert chat in a short while that will be able to address that need.

Bob Eaton (Expert) : Okay, so now let me answer some of your other questions.

Bob Eaton (Expert) :
Q: Will that work for DB Solutions say ISCII to Unicode on the fly or one time conversion (migration)
A: Is your DB accessible from Access?

Bob Eaton (Expert) :
Q: What's a Code page?
A: A code page is a table, which can be used to convert from a single byte (or double-byte encoding, such as Japanese) into a Unicode encoding (which is typically, 2 bytes/characters also). It is (in all but one case) for converting between a "legacy" encoding and Unicode. There is a code page to convert from UTF8 to UTF16 as well... but that's not a code page in its proper sense.

Bob Eaton (Expert) :
Q: We primarily use SQL Server 2000
A: Unfortunately, I've never used SQL Server, but the reason I asked about Access is because with Access, you can write "VBA" code to do manipulation of string in the database. On BhashaIndia, there are some sample snippets of VBA code in Access that show how to convert data in one encoding into Unicode. If you search the BhashaIndia forum archive for the string "CreateObject", you'll find an answer from me on how to do a conversion in Access. If Access can see your SQL tables, then you should be able to do the conversion from ISCII to Unicode easily enough

Subhashini (Moderator) :
Q: We did raise some questions just before in Guest Chat Area
A: Thanks for your questions; unfortunately due to few tech issues we lost them. Hence, request you to post your questions again.

Bob Eaton (Expert) :
Q: That way you say that "Code Page" more similar to a look up table for conversions?
A: Yes, in fact, we just had this discussion because I asked the Unicode list last week why they didn't provide "code page" support for Indic languages (e.g. Hindi). They said that the languages (or more properly, encodings) that have "code page support" are those that use a lookup table for the conversion. But the conversion from ISCII to Unicode is more algorithmic and complex than a lookup table. For example, Indic languages tend to have reordering issues (e.g. reph comes after the vowel it is pronounced before, ikar comes before the character it is pronounced after). In Unicode, the order is logical, so 'book' would be k - i - t - aa - b (kitaab) but in your legacy encoding it is probably, i - k - t - aa - b, so the re-ordering makes the "lookup table" approach not work.

Bob Eaton (Expert) :
Q: what is the difference between UTF8, UTF16 & Unicode?
A: In UTF16, the Unicode characters are represented (normally) by two bytes (e.g. as I mentioned 0915 for 'ka' in Devanagari). There is no odd number of bytes for a character in UTF-16. The problem with this encoding is that for ASCII letters (whose high byte is '0'), some programs will think this is a string termination. If they deal with strings of *bytes* for characters (rather than strings of *words*), they can't work in UTF-16.

Bob Eaton (Expert) :
Q: what is the difference between UTF8, UTF16 & Unicode?
A: In UTF-8, the *same* Unicode characters are represented in a "string of bytes" form. For example, the Devanagari 'k' (which is 0915 in UTF-16) is a three-byte sequence like, a1 e0 24 (or something like that). This is mainly for programs whose internal structure expects "narrow" strings of data rather than "wide" strings... Must programs that support Unicode will do so with UTF_16, but some might use UTF-8

Bob Eaton (Expert) :
Q: Is Still numerals are considered as strings in Excel?
A: I'm not an expert in Excel, so I'm not sure, but I think numerals are stored in Excel documents as binary values; not numbers. I say this because if I *want* the numeral to be represented as a string, I have to prefix it with a ' character

Bob Eaton (Expert) :
Q: Are you talking about your SIL Converters?
A: I can answer any questions you might have about SIL Converters. I'll give a brief overview now in case it answers your question:

Bob Eaton (Expert) :
Q: Are you talking about your SIL Converters?
A: As I mentioned, some colleagues in SIL created a package called "TECkit" for creating a table and programming to convert from one encoding to another. There was another tool used for many years called "Consistent Changes". The problem with these tools is two fold: 1) they are rather complex to create and write and 2) the user-interface for using them is not very nice. So my piece of the puzzle was to create a COM/.Net wrapper around them. This wrapper has a primary function called "Convert" which takes an input string and does the conversion (using the underlying conversion engine) and returns the converted string. Currently the "conversion engines" available are: "TECkit", "CC", and the IBM ICU converters/transliterators. In the next release, I've added a Python, Perl, and Regular Expression plug-in

Bob Eaton (Expert) :
Q: Is there any way to get VB6 apps use IME, can we monitor the keyboard keys and then convert it to Hindi characters?
A: I'm pretty sure the answer is "no". Just like a VC++ program that is built without the _UNICODE compiler define, the Windows objects in the application are registered using the "Ansi" (a.k.a.. narrow) methods. When you register a window as an Ansi window, it can't receive a Unicode characters. In fact, the problem is that the Windows OS *thinks* such a program is a "narrow" app and it tries to convert the keyboard input (which is otherwise Unicode) into Ansi by using the default system code page. Unfortunately for Indic data, there is no "code page" support to convert those characters correctly. So Even thought your VB6 program can correctly *display* Unicode data (I've done it before), it can't take character input of the same.

Bob Eaton (Expert) :
Q: Which method/plug-in/language is best for conversion?
A: Depends on the application. For encoding conversion from say, Susha to Unicode, there's a TECkit map available so it is the 'latest' conversion tool and would therefore be preferred. Another application in the SIL Converters package is called "Spelling Fixer" which is used as a programmatic search and replace tool. If my language helpers types the word "long" as लम्बा, and I want to use the anusvar instead (i.e. लंबा), I can make a search and replace to replace all "half=m' with anusvar. For this application, I would use CC, because it can very easily be added to programmatically. If I had "search and replace" needs, I might use Perl, which I don't know well, but which I'm becoming quite impressed with. In fact, this highlights to me one of the main benefits of SIL Converters is that from a programming point of view, it really doesn't matter what underlying conversion engine you use. They all have the same interface (i.e. the "Convert" method I mentioned earlier). The wrapper classes take care of all the function

BhashaIndiaModerator (Expert) : long as lambaa

Bob Eaton (Expert) :
Q: Is it possible convert date values to another language? Because Tamil has different set of Tamil months
A: I'm not sure, but if it is possible, I'll be that IBM's ICU can do it.

Bob Eaton (Expert) :
Q: Hi Bob - I have seen an online converter - which converts from Legacy encoding (ASCII - Font Based Solution for Tamil) to Unicode. It uses find and replace method. Trying to simulate it in .NET. Is there any other better method other than find and replace?
A: If your Tamil "legacy" font doesn't have code page support (which it doesn't unless it follows the ISCII encoding standard), then you'll have to have another solution: TECkit, CC, would also be possible if you have an serious need and don't mind spending the time learning the syntax and how to program in it. See the http://scripts.sil.org site for details about the TECkit and CC technologies. .Net only really helps you if there's a code page encoding involved.

Bob Eaton (Expert) :
Q: As I heard, You need to use FORMS2.0 Control for VB6 to display Unicode data. Isn't possible to enable in someway or other to get Unicode INPUT
A: I'm pretty sure that though the Forms 2.0 control will display Unicode data, it is not possible to get Unicode input with it. I think this is because the window that is getting the keyboard input initially (perhaps buried in the internals of VB) says it's a "Ansi" application and so the system will automatically convert the Unicode input to Ansi (which you don't want)--even though the control will otherwise support it. We've had this topic on BhashaIndia before and I'm pretty sure that this analysis is correct: there is no way to get Unicode input with VB6--Even if you use controls that otherwise support Unicode

BhashaIndiaModerator (Expert) : We solicit more questions. The forum is open

BhashaIndiaModerator (Expert) : Dear BhashaIndia Chat participants, we are now on the subject of how to make a VC++(nominally version 6.0) program support Unicode.

BhashaIndiaModerator (Expert) : So go ahead and post your queries

BhashaIndiaModerator (Expert) : Bob will provide a brief about the subject

Bob Eaton (Expert) :
Q: Is there any way to get VB6 apps use IME, can we monitor the keyboard keys and then convert it to Hindi characters?
A: Sorry, for not reading your question more clearly: yes, if you write your own keyboard handler which knows how to map the "Ansi" character you receive from a keystroke and convert it to Unicode manually, then you could have Unicode input in VB6. For example, could associate some particular code page with your keyboard input, then you could "monitor" the input characters and convert

Bob Eaton (Expert) :
Q: Is there any way to get VB6 apps use IME, can we monitor the keyboard keys and then convert it to Hindi characters?
A: For example, you could turn the 'k' key (English), which is code point 0051, or something and turn that into 0915, then you could have it... but in that case, it isn't really Unicode input. It's Ansi input that you are converting to Unicode. This would work, but you'd basically have to create your own IME! (not a simple job, to be sure)

Subhashini (Moderator) : Hi all , this brings us close to the time up! We have 10 mins to conclude the chat. Please feel free to email Bob at pete_dembrowski@hotmail.com for any additional queries.

Bob Eaton (Expert) :
Q: Hi Bob....Where exactly could you use Unicode in MFC application and how to go about it?
A: Typically, all string data in MFC is in a CString object. In VC6, this object "changes shape" depending on whether you've built the application as a Unicode app (i.e. using the _UNICODE precompiler define) or an Ansi app (i.e. using the default _MBCS define). If you define _UNICODE, then the CString will be a string of wide, UTF-16 encoded text. If you use _MBCS, then the CString will be a string of narrow, Ansi (or perhaps UTF-8) characters. The thing to worry about is when the strings get converted at certain interfaces. For example, if you use strcpy, then it works only on narrow strings, so if you build with _UNICODE, the 'strcpy' must be changed to use the wide version, wstrcpy (or better yet, use the encoding independent version 'tcscpy' so it will also follow the compiler define (in _UNICODE, it becomes wcscpy and in _MBCS, it becomes strcpy)

Bob Eaton (Expert) :
Q: Is it possible to list out the classes that deals with languages
A: Do you mean in .Net?

Subhashini (Moderator) : Hi Anbirkiniyan , I leave it to the mercy of the expert Bob to decide on that

Bob Eaton (Expert) :
Q: I created my own Tamil converter and that application (small piece of program) submitted to BhashaIndia also. I asked for the comments. No Reply Yet! .Please Reply if it is bad also!
A: Unfortunately, Tamil is "all Greek to me" (i.e. I can't even read it). What is the encoding that it is supposed to be for? I have some colleagues who might know the answer, but I'd need to know what font it is for to see if they are using it (they probably can't test it if it isn't one they use).

BhashaIndiaModerator (Expert) : Well, Bob is ready to chat for another day. So, half an hour is fine by him. Thanks Bob!

Bob Eaton (Expert) :
Q: Anyway Hope learning TECKIT would not cost money
A: It costs no money (it's freely downloadable), but it might cost in time
By the way, though SIL Converters contains the "run-time engine" of TECkit, it doesn't have the documentation on how to write a TECkit map. If you need to learn how to do that, then you'll want to download the TECkit package that includes documentation from http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=TECkitDownloads

BhashaIndiaModerator (Expert) : He has Anbirkiniyan, The chat is on for another half an hour. We will end the chat at 12:45

BhashaIndiaModerator (Expert) : Dear BhashaIndia Chat participant, we request you to refrain from asking the same question repeatedly. Please wait your turn while Mr. Eaton finishes answering the query at hand.

BhashaIndiaModerator (Expert) : Thanks

Bob Eaton (Expert) :
Q: I used Ansi only! But now trying for Unicode also (not yet submitted). But the layout used is the same that of the Tamilnadu state government. Why Microsoft itself try to adapt this layout (Provided by Government). Instead of everyone making trials
A: "Ansi" means "ASCII" (which is English a-z and A-Z plus punctuation, which are all in the lower 128 of a byte) plus some other combined letters and diacritics for European languages (in the "upper 128" of a byte). So when you say, that you are using "Ansi", what you mean is that you are using a legacy encoding for Tamil. There are no Tamil characters in "Ansi" (which is the problem that Unicode is trying to solve). If this comment doesn't answer your question, please ask it again.

Bob Eaton (Expert) :
Q: Bob if you wish I could give the Online Converter URL. Legacy Tamil font uses very similar approach to Susha. Also Tamil has some Encodings TAB, TAM, and TSCII
A: Please do mention the online conversion site on BhashaIndia. It'll be good to have a record. Do you know whether TSCII and the other Tamil encodings have code page support? If so, then conversion should be easy.

Bob Eaton (Expert) :
Q: Thanks Bob. We would give a TRY on TECKIT. Will bug you at the BhashaIndia forum, or by mail later. Hope you would not mind.
A: Not at all

BhashaIndiaModerator (Expert) :
Q:Thanks BhashaIndia Moderator. Hope half an hour more would be fine for now. We have more queries however. Especially the VC++ support though we have not worked in VC++. We would be happy to chat with Bob later if he could not stay more
A:Do start posting your queries on VC++ now

BhashaIndiaModerator (Expert) : Dear BhashaIndia Chat participants, we are now on the subject of how to make a VC++(nominally version 6.0) program support Unicode.

Bob Eaton (Expert) :
Q: Thanks Subhashini. Hope Bob would consider the request to stay little more..
A: On the topic of VC++ 6 to support Unicode, the issue can be stated as:

Change (or add) configurations for a Unicode build in which the default _MBCS pre-compiler define is replaced by _UNICODE (I make a copy of the "Release" and "Debug" configuration and call it ReleaseU and DebugU).
Fix all the compiler errors
This means wrappering all string literals with the _T() macro (so it becomes a wide string in the _UNICODE build and a narrow string in an _MBCS build) Convert all functions that take narrow strings (e.g. strcpy) to use the "build-independent" form (i.e. tcscpy) This includes functions like atoi (to ttoi), itoa (to itot), etc... If you use any prototypes like "const char *" these should be changed to "LPCTSTR", etc... Ask me offline for a more complete list. Are there any other questions on this topic?

Bob Eaton (Expert) :
Q: Thanks for taking time for the VC++. As we're not more familiar with VC++, can you start yourself? How different it's from VB6? Will that 'ANSI' window problem not applicable to VC++?
A: You can make a VC++ version 6 program support Unicode directly by defining the _UNICODE precompiler define as discussed before. When you do this there are two side effects that you might be happy to know:

It *will* take Unicode input from IMEs (because the windows will be registered as Unicode windows when you use the _UNICODE switch) and
it will only work on an NT-based OS (e.g. NT 5, Windows 2000, 2003, XP, etc). That is, without extra work, VC++ v 6 program built with the _UNICODE switch will not work on Win9x. Actually, there is a way to make it work, but it isn't likely to work too well with Indic (since it doesn't have good code page support) and is a tad bit complicated. Ask me offline if you want more details (or look up "MSLU" on the MSDN website--http://msdn.microsoft.com. Unfortunately, there is no equivalent in VB6 for the _UNICODE switch

Bob Eaton (Expert) :
Q: Is it possible to input Unicode chars in VC++ program?
A: Yes, if you build it with the _UNICODE precompiler define. It receives Unicode characters by default. One thing to beware of is that when you use the _UNICODE define, your CStrings (in v6) are 'wide strings'. If you have an interface that requires a narrow interface, then you have to convert from 'wide' to 'narrow'. This is easily possible by using certain macros (in V6, it is the 'T2A' and 'A2T' macros; in VS.Net 2002 and newer, it is CT2A and CA2T classes). But beware that this conversion will use a code page (in VC6, it uses the default system code page; in VS.Net 2002 and newer, you can specify the code page to use). But this doesn't help us with Indic, because there are no code pages for Indic languages (except ISCII, and as far as I know, no fonts follow the ISCII standard).

Bob Eaton (Expert) :
Q: Heard about MSLU - Is it Microsoft layer for Unicode or something else? Is it supported now?
A: Yes. It allows you to build an application with the _UNICODE switch and yet interact with the OS using the "narrow" interfaces. At each of these interfaces, the 'wide' to 'narrow' conversion is done. But again, for Indic, this doesn't really help us. It works fine for ranges of Unicode that have code page support (e.g. Japanese, Korean, Turkish, etc), but not the others

BhashaIndiaModerator (Expert) :
A: Dear BhashaIndia Chat participants, we are heading towards the end of this session. We request you to keep the queries as concise as possible, since time is running short

Bob Eaton (Expert) :
Q: Bob: Can we have some code snippets and more help links?
A: Yes, most of what I've written is "open source" (or at least freely available). Unfortunately, the server where I've got is stored is not open... But if you tell me something in particular, I can send snippets (offline)

Bob Eaton (Expert) :
Q: That VC++ program should work anyway in Windows 98 may be without Unicode INPUT. Right?
A: No... if you build with _UNICODE (and don't use MSLU), it will link with certain kernel functions that don't exist on Win9x, so it really doesn't even run. It gives DLL linkage errors. If you use MSLU, then it will work, but as you mentioned, it won't take IME input. It'll display in Unicode correctly if you *manually* call output functions like DrawTextW rather than the default DrawTextA.

BhashaIndiaModerator (Expert) : Do check out Bob Eaton's interview on :
http://www.bhashaindia.com/Patrons/SuccessStories/bobeaton.aspx?lang=en

Bob Eaton (Expert) :
Q: It's really unfortunate MS doesn't provide similar switch in VB 6?
A: I think it's a matter of encouraging folks to switch to .Net for which all these problems go away. It doesn't take too much more to learn (.Net) and I find that I'm at least 100% more productive in .Net than I was in VC++ or VB (conservatively)

BhashaIndiaModerator (Expert) :
A: Some links of interest:
http://www.bhashaindia.com/Patrons/SuccessStories/bobeaton.aspx?lang=en
BI Moderator2:
for Solution Directory http://www.bhashaindia.com/community/solutiondirectory/knowmore.aspx

Bob Eaton (Expert) :
Q: Can you explain little on COM objects also? Do scripting languages support Unicode better?
A: Yes. VBScript, VBA, and .Net (and I'm sure JScript as well), support Unicode strings natively. Even 'String' variables in VB6 programs are UTF-16 (though if the data is from a legacy font it'll be "hacked-UTF16" and not true Unicode). And all strings that go thru COM channels have always been UTF-16 so they support it very well.

Bob Eaton (Expert) :
Q: Where to get standard Hindi text for GUI, like "File", "OK", "Cancel" ?
A: Good question! I'll see if MS will be willing to publish what they've used in their Hindi LIP and in Hindi office. It would really be great if we all used the same values and as far as I know, there is no "community" where the Hindi is listed (like with other Indic languages).

BhashaIndiaModerator (Expert) : We will be concluding the chat now!
We, at the BhashaIndia team certainly hope that the chat was useful and informative for all of you. Thanks for taking out the time and enthusiastically participate in this BhashaIndia Expert Chat.

Thank you all for your active participation. Do look forward to more such interactive sessions on BhashaIndia.

And a big thanks to Bob for hosting this chat.

Please do find the chat transcript on the BhashaIndia site: http://www.bhashaindia.com

The transcript will also be hosted on:
http://www.microsoft.com/india/communities/chat/Transcripts.aspx

For more updates on similar activities related to Indic language computing do visit and register at www.bhashaindia.com

Subhashini (Moderator) : Thanks to all of you to participate in the chat today
And a special thanks to Bob for this informative session !

Subhashini (Moderator) : Id Mubarak! Have a great weekend

BhashaIndiaModerator (Expert) : Thanks for the opportunity to share and hope to see you all on BhashaIndia! Goodbye!
Thanks Bob, it was really a enlightening session for all of us.

The Report of the chat will be available at the BhashaIndia website