Story of Tamil encodings: TACE, UTF8 and open-tamil

Hello everyone! Hope this finds you in good spirits. In North America, Boston especially, the cold weather and holiday season is our Deepavali as it were. Cold weather and festive season brings a lot of joys and challenges – but today I want to talk about the challenges of exchanging Tamil text/information on the Internet.

Christmas tree lighting ceremony at Boston Common. Dec, 2014.

History of Tamil encodings

Tamil encoding has a long history with 8-bit extended ASCII called TSCII, TAM (Tamil Monolingual)/TAB (Tamil Bi-lingual) encodings of the late 80s and early 90s.

Then enter Windows. Microsoft Windows with the Microsoft Word let some Tamil software vendors introduce font-based encodings. This probably is the most egregious of all Tamil encoding methods invented, IMO. Still this would show the books in fonts like Latha, Lohit etc. You needed the right font-map  with glyphs independent of encoding to read the text. Otherwise the text would end up garbled like a mish-mash of ‘?’ or []-(tofu block) characters.

Finally the Tamil computing community, software ventors, members of INFITT (among others Mr. Chellappan of Palaniappa Brothers, Chennai, Prof. Ponnavaiko, and Dr. K. Kalyanasundaram of EPFL) sat down with the pioneering people at Unicode consortium and hashed out a chunk of the Unicode standards space for Tamil letters, which is what we have today. So now you know if Android, iOS and Windows support the Tamil text, it is most likely due to the benefit of years of work from this motely crew of genteel anonymous strangers. Thanks everyone!

Now the web had matured since 1990s, and Unicode support by blogs, and input method editors (IME) on Linux, Windows and Mac enabled growth and exchange of Tamil content on the Internet. Unicode encoded in UTF-8 became the de-facto standard of the Tamil community online, despite diktats from Tamilnadu government, and other standards agencies which were left behind in the shadows of stand-alone computing world. Welcome Internet. Now change was not an option.

TACE v. UTF-8

Today there is an alternative proposal, has been for many years I think. TACE standard is being championed by people in Tamilnadu government and a few publishing agencies. Prof. Ponnavaiko is rumoured to be on some of these committees. The wikipedia article  summarizes their case for TACE16 encoding standard because of some computational ease in representing the standard. I still think UTF-8 suits the general purpose computation, especially on the web.

I take exception to pushing TACE16 particulary because of the not-invented-here syndrome.  Unicode is not just for Tamil, it is shared with English, Arabic, Chinese, Cantonese and other tonal and syllabic languages. Still yet some languages are – so we don’t have the direct glyph per code mapping. The situation is somewhat similar for Hindi and other Indic-languages.

Another reason is advent of libraries, one of which I helped create and open-source : open-tamil. I’m sure other Tamil developers have their own versions behind corporate or closed doors to pre-process UTF-8 and then do text manipulation.

I guess time will pick the technology based on capability evolution and utility of existing tools. I hope we don’t return to font-based-encoding of the 90s and 2000s, and live in the more saner world of Unicode. If TACE16 should be a new standard, I hope someone makes the converters from and to UTF-8.