iconv a GNU utility can help converting text documents back and forth from various encoding schemes. Particularly it is of interest to us, Tamil speaking folks, because it can convert from UTF-8 to TSCII and back.
If you wanted to convert, hello.utf8 from UTF-8 encoding into TSCII you could use it as follows,
where in the Linux shell environment you can redirect the output into the TSCII encoded file.
Developers: Someday I hope volunteers will add more historical Tamil encodings, primarily TAM, TAB, and other font-based encoding schemes to the libiconv. Please start development using git repository at GNU sources.
Hello everyone! Hope this finds you in good spirits. In North America, Boston especially, the cold weather and holiday season is our Deepavali as it were. Cold weather and festive season brings a lot of joys and challenges – but today I want to talk about the challenges of exchanging Tamil text/information on the Internet.
History of Tamil encodings
Tamil encoding has a long history with 8-bit extended ASCII called TSCII, TAM (Tamil Monolingual)/TAB (Tamil Bi-lingual) encodings of the late 80s and early 90s.
Then enter Windows. Microsoft Windows with the Microsoft Word let some Tamil software vendors introduce font-based encodings. This probably is the most egregious of all Tamil encoding methods invented, IMO. Still this would show the books in fonts like Latha, Lohit etc. You needed the right font-map with glyphs independent of encoding to read the text. Otherwise the text would end up garbled like a mish-mash of ‘?’ or -(tofu block) characters.
Finally the Tamil computing community, software ventors, members of INFITT (among others Mr. Chellappan of Palaniappa Brothers, Chennai, Prof. Ponnavaiko, and Dr. K. Kalyanasundaram of EPFL) sat down with the pioneering people at Unicode consortium and hashed out a chunk of the Unicode standards space for Tamil letters, which is what we have today. So now you know if Android, iOS and Windows support the Tamil text, it is most likely due to the benefit of years of work from this motely crew of genteel anonymous strangers. Thanks everyone!
Now the web had matured since 1990s, and Unicode support by blogs, and input method editors (IME) on Linux, Windows and Mac enabled growth and exchange of Tamil content on the Internet. Unicode encoded in UTF-8 became the de-facto standard of the Tamil community online, despite diktats from Tamilnadu government, and other standards agencies which were left behind in the shadows of stand-alone computing world. Welcome Internet. Now change was not an option.
TACE v. UTF-8
Today there is an alternative proposal, has been for many years I think. TACE standard is being championed by people in Tamilnadu government and a few publishing agencies. Prof. Ponnavaiko is rumoured to be on some of these committees. The wikipedia article summarizes their case for TACE16 encoding standard because of some computational ease in representing the standard. I still think UTF-8 suits the general purpose computation, especially on the web.
I take exception to pushing TACE16 particulary because of the not-invented-here syndrome. Unicode is not just for Tamil, it is shared with English, Arabic, Chinese, Cantonese and other tonal and syllabic languages. Still yet some languages are – so we don’t have the direct glyph per code mapping. The situation is somewhat similar for Hindi and other Indic-languages.
Another reason is advent of libraries, one of which I helped create and open-source : open-tamil. I’m sure other Tamil developers have their own versions behind corporate or closed doors to pre-process UTF-8 and then do text manipulation.
I guess time will pick the technology based on capability evolution and utility of existing tools. I hope we don’t return to font-based-encoding of the 90s and 2000s, and live in the more saner world of Unicode. If TACE16 should be a new standard, I hope someone makes the converters from and to UTF-8.
now the contents of the webpage <BODY> tag are stored in the variable tatext.
Parsing Tamil text
This is where open-tamil library really shines; you can pull out the letters from a tamil string encoded in UTF-8 with a multi-byte encoding, in right order – i.e. you can write programs at the Tamil-letters level instead of worrying about the byte ordering, and uyirmei grouping etc.
Get the tamil letters from the text using the ‘get_letters‘ API,