Classifying Tamil words – part 1

Problem

One of problems faced when building a Tamil spell checker, albeit somewhat marginal, can be phrased as follows:

Given a series of Tamil alphabets, how do you decide if the letters are true Tamil word (even out of dictionary) or if it is a transliterated English word ?

e.g. Between the words, ‘உகந்த’ vs ‘கம்புயுடர்’ can you decide which is true Tamil word and which is transliterated ?

Tools

This is somewhat simple with help of a neural network; given sufficient “features” and “training data” we can train some of these neural networks easily. With current interest in this area, tools are available to make this task quite easy – any of Pandas, Keras, PyTorch and Tensorflow may suffice.

Generally, the only thing you need to know about Artificial Intelligence (AI) is that machines can be trained to do tasks based on two distinctive learning processes:

  1. Regression,
  2. Classification

Read more at the Wikipedia – the current “problem” is a classification task.

Features

Naturally for task of classifying a word, we may take features as following:

  1. Word length
  2. Are all characters unique ?
  3. Number of repeated characters ?
  4. Vowels count, Consonant count
    1. In Tamil this information is stored as (Kuril, Nedil, Ayudham) and (Vallinam, Mellinam and Idayinam)
  5. Is word palindrome ?
  6. We can add bigram data as features as next step

Basically this task can be achieved with new code checked into Open-Tamil 0.7 (dev version) called ‘tamil.utf8.classify_letter

Screen Shot 2017-12-17 at 1.03.03 PM.png

Data sets

To make data sets we can use Tamil VU dictionary as a list of valid Tamil words (label 1); next we can use a transliterated list of words from English into Tamil as list of invalid Tamil words (label 0).

Using a 1, 0 labeled data, we may use part of this combined data for training the neural network with gradient descent algorithm or any other method for building a supervised learning model.

Building Transliterated Data

Using the Python code below and the data file from open-tamil repository you can build the code and run it,

def jaffna_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(jaffna.Transliteration.table,eng_string)
  return tamil_tx

def azhagi_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(azhagi.Transliteration.table,eng_string)
  return tamil_tx

def combinational_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(combinational.Transliteration.table,eng_string)
  return tamil_tx

# 3 forms of Tamil transliteration for English word
jfile = codecs.open('english_dictionary_words.jaffna','w','utf-8')
cfile = codecs.open('english_dictionary_words.combinational','w','utf-8')
afile = codecs.open('english_dictionary_words.azhagi','w','utf-8')
with codecs.open('english_dictionary_words.txt','r') as engf:
for idx,w in enumerate(engf.readlines()):
  w = w.strip()
  if len(w) < 1:
    continue
  print(idx)
  jfile.write(u"%s\n"%jaffna_transliterate(w))
  cfile.write(u"%s\n"%combinational_transliterate(w))
  afile.write(u"%s\n"%azhagi_transliterate(w))
  jfile.close()
  cfile.close()
  afile.close()

to get the following data files (left pane shows ‘Jaffna’ transliteration standard, while the right pane shows the source English word list); full gist on GitHub at this link

Screen Shot 2017-12-17 at 1.47.42 PM.png

In the next blog post I will share the details of training the neural network and building this classifier. Stay tuned!

 

ஓபன் தமிழ் வழி சொல்திருத்தி

எல்லாம் நலமாக இருக்கிறீர்கள் என்று எண்ணுகிறேன். சென்ற வாரம் பாஸ்டன் நகரில் சைபீரியா குளிர் -22*C, அனால் எனக்கும் ஓபன் தமிழ் திட்டத்தில் பங்களிக்க நேரம் கிடைத்தது.

ஓபன் தமிழ் வழி சொல்திருத்தி ஒன்றை உருவாக்கும் பணியை மீண்டு இந்த ஆண்டு தொடங்கினேன். இந்த திருத்தியை “பல்-நிலை” (multi-pass) முறையில் முன்பே திட்டமிட்ட படி நாம் செயல்படுத்தலாம். இன்றுவரை செய்த நிரல்களை இங்கு பாருங்கள்
நேரம், ஆர்வம் இதற்க்கு சற்றுற ஒதுக்குங்கள்.
நன்றி,
முத்து