Some weeks ago I started playing with and made a bunch of alphabet nomogram style pictures with Easel JS. Its interesting to think of possibilities.
Canonical Tamil has 12 + 1 vowels [உயிர்], 18 consonants [மெய்] and 12×18 = 216 [உயிர்மெய்] conjugate letters. Together the can be arranged in a Table of named column [12 for vowels] and named rows [18 for consonants] and cells of row-column at the conjugate letters.
I posted several images on Twitter; first one based on 3 concentric circles arrangement of the letters.
Another image based on sunflower-spiral:
The other based on a logarithmic spiral:
Another image looks to illustrate vowels and consonants as an interactive widget where you select uyir and mei letters from the outer + inner circles to form the uyirmei conjugate letter in the center.
Sometimes its good to look around and learn from what’s happening in other realms of Indian language processing. In my limited experience language efforts in computing for Indian language revolve around the Dravidian languages, Bengali, Marathi or Hindi. சில நேரங்களில் குண்டு சட்டியில் குதிரை ஓட்டுரமாதிரி கணினி மொழியியல் ஆயிடக்கூடாது – தனிபட்டபடியும் சரி – மொழிகளுக்கிடையிலும் சரி.
Some good project efforts in Hindi Language processing (open-source) are reviewed in this blog; [there are projects like open-tamil API for Hindi, e.g. a get_letters like function, provided by tokenizer project here (with caveat that it is a small function only compared to expansive open-tamil), but we talk about the ML/A.I. focused projects here].
Hindi word embedding called Hindi2vec (along lines of word2vec project). The idea is to associate similar words (e.g. ‘பல்’,’நாக்கு’,’வாய்’) with similar vectors within a neighborhood of each other using concepts of linear-algebra – vector spaces and matrices. So when you search or mistype or want to classify there is a neighborhood of known words closer to the potentially unknown word input from the user; such known neighborhood identification can help decision making and drive various learning, classification or dialogue systems.
Hindi Transliteration Model project and the DeepTrans project– this is a really cool where they developed a reference data set of English to Hindi and trained a model for transliteration from English to Hindi of user input.
We can do this in Tamil with the as we have many transliteration schemes as set out in open-tamil, but the even a same user is not strictly going to follow the scheme strictly, nor do different users follow the same scheme – in all these cases a machine learning A.I. model maybe more robust by virtue of learning the underlying rules. Very interesting project, and fairly simple to implement for Tamil from open-tamil transliterate module and SciKit Learn or other frameworks with high 95% correct prediction rate.
Hindi-English parallel dictionary with 8MB size (probably 500,000 words or so I imagine) here – this can be a good jump starting point for translation projects if such existed for Tamil. e.g. Can we have a parallel dictionary English – Tamil for the simple TVU word list/dictionary ?
Hindi Sentiment Analysis project does a ternary [good, bad, neutral] classification of text. They do this by using a CDAC-model which is super curious to me; maybe CDAC-India (Pune) has a Tamil POS-Tagger too ? Probably they do.
Tamil POS-Taggers widely reported; AU-KBC Chennai has a POS-Tagger, probably the best for Tamil; Dr. Vasu Renganathan has a POS-Tagger, but both these works are not available currently for open-source use, however their techniques are openly shared via their papers in INFITT conferences.
Sorkandu project can also be revived for making an open-source POS-Tagger
Emotion Recognition in Hindi Speech project – this work from IIT KGP students builds a reference audio data set with known emotion labels and build some kind of a machine learning model, and then they get 5x better than random coin-toss/guess for the audio emotion recognition from speech.
We probably don’t have any work on this direction in the open, but interestingly NIST in USA sponsored a Tamil Key Word Search (KWS), reports of which were published by a Singapore team in academic journals. More interestingly the KWS challenge released 2 hrs of speech data with tagged information. In USA, government released data usually qualifies for public-domain – e.g. pictures from NASA etc. so maybe there is a way to get this data. கடவுளுக்கு தான் வெளிச்சம்!
While we know, Google ASR, Youtube online translation of English videos into Tamil closed-captioning, foreign languages to Tamil Translation, Transliteration inputs all use perhaps the most advanced models in Tensorflow on cloud hardware, none of this technology is directly usable for free – maybe for a price via their Google cloud API offerings – and we probably don’t know all the details of how they achieved these magical software applications for Tamil language – anyones guess like mine is using the massive data sets they have from our Tamil news groups, emails, websites, and user input + Tensorflow A.I / ML magic. At least, we have to be grateful for Google-aandavar like some friends commented on freetamilcomputing group. 🙂
Surprisingly, to my knowledge, there are no planned efforts, ongoing or completed open-source projects like these in Tamil. Maybe another avenue for growth, and in this case Hindi projects (at least in open-source domain) seem to have forged ahead!
“எழில் என்பது முதல் திர மூலமாக கிடைக்கக்கூடிய தமிழ் ஸ்கிரிப்டை அடிப்படையாகக்
கொண்ட நிரலாக்க மொழி ஆகும், இது விண்டோஸ் 32, 64 மற்றும் Ubuntu, Fedora Linux மற்றும் Docker தளங்களில் 2017 ஆம் ஆண்டில் வெளியான http://ezhillang.org. எழில் ஒரு பைத்தான்-அடிப்படையிலான மொழிஇயக்கி. வளர்ச்சி GitHub வழியாக நடைபெறுகிறது.
திறந்த-தமிழ் தமிழ் நெருக்கமாக தொடர்புடைய தமிழ் மொழி செயலாக்க கருவிகள் கொன்டது; நூலகம் ஆரம்பத்தில் எழில் மொழியின் ஒரு கீற்றாக துவங்கியது; ஆனால் விரைவாக வார்த்தை-வடிகட்டுதல், N- கிராம் பகுப்பாய்வு, புணற்சசி இலக்கணம், தமிழ் எழுத்துப்பிழை சொல்திருத்தி உருவாக்கம் முதலியன, பல மொழிகளில் பைத்தான், முக்கியமாக, ஜாவா, ரூபி முதலியவற்றிற்கான தமிழ் தொகுப்புகள் பரிசுரம் செய்யபட்டன். http://tamilpesu.us வலையில், மற்றும் Play Store இல் Kalsee பயன்பாட்டில் எங்கள் வேலைகளை பயன்படுத்தலாம்.”
Thanks to kind arrangements of friends in Chennai Python, and open-tamil community I had an opportunity to make a presentation on Open-Tamil and Ezhil-Lang projects, and completion. Talk was well received, and delivered in unique Tamil mixed with English due to comfort of being in Chennai only!
This is collective work of our team underlying the website (written in Django+Python) highlighting various aspects of open-tamil like transliteration, numeral generation, encoding converters, spell checker among other things. At this time I hope to keep the website running through most of this year, and add features as git-repo https://github.com/Ezhil-Language-Foundation/open-tamil gets updated.
Thanks to Mr. Syed Abuthahir, many months ago, in winter of 2017, he has developed an interface for open-tamil on the web and shared with us under GNU Affero GPL terms. Later, we is added as part of main open-tamil as well.
Tamil internet Conference 2018 to take place at TNAU, Coimbatore, India later this year. Please see call for papers (March 30th deadline) to share your new and upcoming works in Tamil, linguistics and applied computer technology. http://www.tamilinternetconference.org
Please see the email from Prof. Kalyanasundaram, chair of Tamil Internet Conference – 2018.
Email from Prof. Kalyan announcing call for papers for Tamil Internet Conference 2018, at TNAU Coimbatore, India.