A.I./ML for Hindi Language Processing
Sometimes its good to look around and learn from what’s happening in other realms of Indian language processing. In my limited experience language efforts in computing for Indian language revolve around the Dravidian languages, Bengali, Marathi or Hindi. சில நேரங்களில் குண்டு சட்டியில் குதிரை ஓட்டுரமாதிரி கணினி மொழியியல் ஆயிடக்கூடாது – தனிபட்டபடியும் சரி – மொழிகளுக்கிடையிலும் சரி.
Some good project efforts in Hindi Language processing (open-source) are reviewed in this blog; [there are projects like open-tamil API for Hindi, e.g. a get_letters like function, provided by tokenizer project here (with caveat that it is a small function only compared to expansive open-tamil), but we talk about the ML/A.I. focused projects here].
- Hindi word embedding called Hindi2vec (along lines of word2vec project). The idea is to associate similar words (e.g. ‘பல்’,’நாக்கு’,’வாய்’) with similar vectors within a neighborhood of each other using concepts of linear-algebra – vector spaces and matrices. So when you search or mistype or want to classify there is a neighborhood of known words closer to the potentially unknown word input from the user; such known neighborhood identification can help decision making and drive various learning, classification or dialogue systems.
- Hindi Transliteration Model project and the DeepTrans project– this is a really cool where they developed a reference data set of English to Hindi and trained a model for transliteration from English to Hindi of user input.
- We can do this in Tamil with the as we have many transliteration schemes as set out in open-tamil, but the even a same user is not strictly going to follow the scheme strictly, nor do different users follow the same scheme – in all these cases a machine learning A.I. model maybe more robust by virtue of learning the underlying rules. Very interesting project, and fairly simple to implement for Tamil from open-tamil transliterate module and SciKit Learn or other frameworks with high 95% correct prediction rate.
- Hindi-English parallel dictionary with 8MB size (probably 500,000 words or so I imagine) here – this can be a good jump starting point for translation projects if such existed for Tamil. e.g. Can we have a parallel dictionary English – Tamil for the simple TVU word list/dictionary ?
- Hindi Sentiment Analysis project does a ternary [good, bad, neutral] classification of text. They do this by using a CDAC-model which is super curious to me; maybe CDAC-India (Pune) has a Tamil POS-Tagger too ? Probably they do.
- Tamil POS-Taggers widely reported; AU-KBC Chennai has a POS-Tagger, probably the best for Tamil; Dr. Vasu Renganathan has a POS-Tagger, but both these works are not available currently for open-source use, however their techniques are openly shared via their papers in INFITT conferences.
- Sorkandu project can also be revived for making an open-source POS-Tagger
- Emotion Recognition in Hindi Speech project – this work from IIT KGP students builds a reference audio data set with known emotion labels and build some kind of a machine learning model, and then they get 5x better than random coin-toss/guess for the audio emotion recognition from speech.
- We probably don’t have any work on this direction in the open, but interestingly NIST in USA sponsored a Tamil Key Word Search (KWS), reports of which were published by a Singapore team in academic journals. More interestingly the KWS challenge released 2 hrs of speech data with tagged information. In USA, government released data usually qualifies for public-domain – e.g. pictures from NASA etc. so maybe there is a way to get this data. கடவுளுக்கு தான் வெளிச்சம்!
While we know, Google ASR, Youtube online translation of English videos into Tamil closed-captioning, foreign languages to Tamil Translation, Transliteration inputs all use perhaps the most advanced models in Tensorflow on cloud hardware, none of this technology is directly usable for free – maybe for a price via their Google cloud API offerings – and we probably don’t know all the details of how they achieved these magical software applications for Tamil language – anyones guess like mine is using the massive data sets they have from our Tamil news groups, emails, websites, and user input + Tensorflow A.I / ML magic. At least, we have to be grateful for Google-aandavar like some friends commented on freetamilcomputing group. 🙂
Surprisingly, to my knowledge, there are no planned efforts, ongoing or completed open-source projects like these in Tamil. Maybe another avenue for growth, and in this case Hindi projects (at least in open-source domain) seem to have forged ahead!