Tamilscript in development – writing code in a web browser

Recently I have been trying to put together a system for writing code on client side browser by translating the code directly into Javascript. Today, I would like to show a demo of simple translator for Tamilscript

tamilscript

Demo of Tamilscript for use within a web browser. Tamilscript can leverage jQuery, and power of modern web browsers.

Comments and suggestions welcome. I will release the code if we reach any sizable level.
Regards,
-Muthu

Open-Tamil porting for Python 3 and Python 2

For a long time Open-Tamil package was envisioned and supported for both Python 3 and Python 2. Now in work done during last weekend (during harsh Boston winter what better than sit indoors and code away?) I have made the following changes,

  1. source code compatible for both Python 3 and Python 2,
  2. pass unittests on Python 2.6, 2.7, 3.3 and 3.4 with Travis-CI testing , and running successfully on Windows and Linux

This is still development bleeding edge software so please feel free to poke and play, and file bugs. Let us know if you are using open-tamil in your work.

-Muthu

“தமிழ்ழில் நிரல் எழுத – எழில் தமிழ் நிரலாக்க மொழி” புத்தகம் வெளியீடு | (Releasing ‘Write Code in Tamil’ book)

அனைவருக்கும் வணக்கம்.

தமிழில் கணினி மென்பொருள் நிரல் (Computer Software Program) எழுதக் கற்றுக்கொள்ளவேண்டும் என்கிற ஆர்வத்துடன் இந்த “எழில்” தளத்துக்கு வந்திருக்கிறீர்கள். உங்களுக்கு எங்கள் வாழ்த்துகள்! எழில் திட்டத்தில் உங்கள் நேரம் மற்றும் ஆர்வத்திற்கு நன்றி.

நீங்கள் எங்கள் “தமிழ்ழில் நிரல் எழுத – எழில் தமிழ் நிரலாக்க மொழி” புத்தகத்தை, இலவச தரவிறக்கம் செய்ய இணைப்பு http://ezhillang.org/koodam/book/register

உங்கள் கருத்துக்கள் மற்றும் ஆலொசனைகளை எங்களுடன் பகிர்ந்து கொள்ளவும்.

அன்புடன்,
எழில் மொழி அறக்கட்டளை,
பாஸ்டன், அமெரிக்கா

Story of Tamil encodings: TACE, UTF8 and open-tamil

Hello everyone! Hope this finds you in good spirits. In North America, Boston especially, the cold weather and holiday season is our Deepavali as it were. Cold weather and festive season brings a lot of joys and challenges – but today I want to talk about the challenges of exchanging Tamil text/information on the Internet.

Christmas tree lighting ceremony at Boston Common. Dec, 2014.

History of Tamil encodings

Tamil encoding has a long history with 8-bit extended ASCII called TSCII, TAM (Tamil Monolingual)/TAB (Tamil Bi-lingual) encodings of the late 80s and early 90s.

Then enter Windows. Microsoft Windows with the Microsoft Word let some Tamil software vendors introduce font-based encodings. This probably is the most egregious of all Tamil encoding methods invented, IMO. Still this would show the books in fonts like Latha, Lohit etc. You needed the right font-map  with glyphs independent of encoding to read the text. Otherwise the text would end up garbled like a mish-mash of ‘?’ or []-(tofu block) characters.

Finally the Tamil computing community, software ventors, members of INFITT (among others Mr. Chellappan of Palaniappa Brothers, Chennai, Prof. Ponnavaiko, and Dr. K. Kalyanasundaram of EPFL) sat down with the pioneering people at Unicode consortium and hashed out a chunk of the Unicode standards space for Tamil letters, which is what we have today. So now you know if Android, iOS and Windows support the Tamil text, it is most likely due to the benefit of years of work from this motely crew of genteel anonymous strangers. Thanks everyone!

Now the web had matured since 1990s, and Unicode support by blogs, and input method editors (IME) on Linux, Windows and Mac enabled growth and exchange of Tamil content on the Internet. Unicode encoded in UTF-8 became the de-facto standard of the Tamil community online, despite diktats from Tamilnadu government, and other standards agencies which were left behind in the shadows of stand-alone computing world. Welcome Internet. Now change was not an option.

TACE v. UTF-8

Today there is an alternative proposal, has been for many years I think. TACE standard is being championed by people in Tamilnadu government and a few publishing agencies. Prof. Ponnavaiko is rumoured to be on some of these committees. The wikipedia article  summarizes their case for TACE16 encoding standard because of some computational ease in representing the standard. I still think UTF-8 suits the general purpose computation, especially on the web.

I take exception to pushing TACE16 particulary because of the not-invented-here syndrome.  Unicode is not just for Tamil, it is shared with English, Arabic, Chinese, Cantonese and other tonal and syllabic languages. Still yet some languages are – so we don’t have the direct glyph per code mapping. The situation is somewhat similar for Hindi and other Indic-languages.

Another reason is advent of libraries, one of which I helped create and open-source : open-tamil. I’m sure other Tamil developers have their own versions behind corporate or closed doors to pre-process UTF-8 and then do text manipulation.

I guess time will pick the technology based on capability evolution and utility of existing tools. I hope we don’t return to font-based-encoding of the 90s and 2000s, and live in the more saner world of Unicode. If TACE16 should be a new standard, I hope someone makes the converters from and to UTF-8.

Simple Tamil Text to Speech engine

Hello friends, and Tamil aficionados – it is winter in Boston and start of three months of bitter cold, but beautiful skylines. A little treat for surviving through the low sunlight hours and gray cold days.

A simple Tamil Text to Speech (TTS) engine developed by Prof. Vasu Renganathan has been open-sourced through the lobbying work of T. Shrinivasan.

What is a TTS ?

TTS converts text to speech. That was easy. Now the hard part is how does the computer do that? We will investigate how the language structure of Tamil can be used to find a suitable algorithm, a set of rules, when applied systematically can generate the speech from text. But first we need to understand what is a phonetic language.

Phonetic Languages

The key is to identify that Tamil is a phonetic language, we have some simple ways of doing this. Basically if we write “முத்து” we say it as “மு + த் + து” phonetically. English is not the same, “Muthu” is written as such but pronounced in groupings of “Mu”+”th”+”u” as 3-syllables, but 5 letters. In Tamil முத்து has 3-letters and 3 syllables, with the number of letters to syllables mostly remaining the same.

OK, now how do we split a word into its constituent phonemes? Maybe by using open-tamil!

Basic Algorithm

So I propose a simple minded algorithm to generate speech from text.

  1. Split the given Tamil text to words and for each word apply the steps 2 – 5
  2. Split the word into phonemes (phonemes = syllables) for Tamil
  3. For each syllable find the corresponding phoneme (phoneme is pronunciation) in form of a sound clip. In Tamil this has been called a ‘மாத்திரை’.
  4. Concatennate all these phonemes into the word, and apply a linear smoothing filter with a window. This is simple signal processing theory.
  5. Add a pause, beat, based on word spacing or sentence/paragraph spacing.

This algorithm is simplistic, but it does capture the essence of a text-to-speech engine. It could sound fairly mechanical or ‘robotic’ but the sound has to be made better.

Please share your comments, and views. For those in cold countries, stay warm!

 

Programming in Tamil – Lisp-like language in Clojure

Hello readers, and friends – a little blog post to restart things here, after the INFITT-2014 series of posts.

Today I’d like to highlight an interesting effort by Elango was brought to my attention by collaborator, and friend Shrinivasan (of http://goinggnu.wordpress.com blog fame). This effort allows writing in Tamil via  See more for yourself at,  | thanks for email

 

INFITT2014 மாநாட்டில் கட்டுரைகள் தேர்ந்தெடுக்கப்பட்து!

நற்செய்தி –  தெரிவிக்க சந்தோஷமாக இருக்கிறோம். எங்களது அனைத்து கட்டுரைகள்/ஆய்வரிக்கைகளும் INFITT2014 மாநாட்டில் தேர்ந்தெடுக்கப்பட்டன. கட்டுரைகள் வரிசையில் உள்ளன.

எங்கள் அணி INFITT 2014 மாநாட்டில் வழங்குகின்றனர்.

நன்றி,

-முத்து