Namashkaar!

A.I./ML for Hindi Language Processing

Sometimes its good to look around and learn from what’s happening in other realms of Indian language processing. In my limited experience language efforts in computing for Indian language revolve around the Dravidian languages, Bengali, Marathi or Hindi. சில நேரங்களில் குண்டு சட்டியில் குதிரை ஓட்டுரமாதிரி கணினி மொழியியல் ஆயிடக்கூடாது – தனிபட்டபடியும் சரி – மொழிகளுக்கிடையிலும் சரி.

Some good project efforts in Hindi Language processing (open-source) are reviewed in this blog; [there are  projects like open-tamil API for Hindi, e.g. a get_letters like function, provided by tokenizer project here (with caveat that it is a small function only compared to expansive open-tamil), but we talk about the ML/A.I. focused projects here].

  1. Hindi word embedding called Hindi2vec (along lines of word2vec project). The idea is to associate similar words (e.g. ‘பல்’,’நாக்கு’,’வாய்’) with similar vectors within a neighborhood of each other using concepts of linear-algebra – vector spaces and matrices. So when you search or mistype or want to classify there is a neighborhood of known words closer to the potentially unknown word input from the user; such known neighborhood identification can help decision making and drive various learning, classification or dialogue systems.
  2. Hindi Transliteration Model project and the DeepTrans project– this is a really cool where they developed a reference data set of English to Hindi and trained a model for transliteration from English to Hindi of user input.
    1. We can do this in Tamil with the as we have many transliteration schemes as set out in open-tamil, but the even a same user is not strictly going to follow the scheme strictly, nor do different users follow the same scheme – in all these cases a machine learning A.I. model maybe more robust by virtue of learning the underlying rules. Very interesting project, and fairly simple to implement for Tamil from open-tamil transliterate module and SciKit Learn or other frameworks with high 95% correct prediction rate.
  3. Hindi-English parallel dictionary with 8MB size (probably 500,000 words or so I imagine) here – this can be a good jump starting point for translation projects if such existed for Tamil. e.g. Can we have a parallel dictionary English – Tamil for the simple TVU word list/dictionary ?
  4. Hindi Sentiment Analysis project does a ternary [good, bad, neutral] classification of text. They do this by using a CDAC-model which is super curious to me; maybe CDAC-India (Pune) has a Tamil POS-Tagger too ? Probably they do.
    1. Tamil POS-Taggers widely reported; AU-KBC Chennai has a POS-Tagger, probably the best for Tamil; Dr. Vasu Renganathan has a POS-Tagger, but both these works are not available currently for open-source use, however their techniques are openly shared via their papers in INFITT conferences.
    2. Sorkandu project can also be revived for making an open-source POS-Tagger
  5. Emotion Recognition in Hindi Speech project – this work from IIT KGP students builds a reference audio data set with known emotion labels and build some kind of a machine learning model, and then they get 5x better than random coin-toss/guess for the audio emotion recognition from speech.
    1. We probably don’t have any work on this direction in the open, but interestingly NIST in USA sponsored a Tamil Key Word Search (KWS), reports of which were published by a Singapore team in academic journals. More interestingly the KWS challenge released 2 hrs of speech data with tagged information. In USA, government released data usually qualifies for public-domain – e.g. pictures from NASA etc. so maybe there is a way to get this data. கடவுளுக்கு தான் வெளிச்சம்!

While we know, Google ASR, Youtube online translation of English videos into Tamil closed-captioning, foreign languages to Tamil Translation, Transliteration inputs all use perhaps the most advanced models in Tensorflow on cloud hardware, none of this technology is directly usable for free – maybe for a price via their Google cloud API offerings – and we probably don’t know all the details of how they achieved these magical software applications for Tamil language – anyones guess like mine is using the massive data sets they have from our Tamil news groups, emails, websites, and user input + Tensorflow A.I / ML magic. At least, we have to be grateful for Google-aandavar like some friends commented on freetamilcomputing group. 🙂

Surprisingly, to my knowledge, there are no planned efforts, ongoing or completed open-source projects like these in Tamil. Maybe another avenue for growth, and in this case Hindi projects (at least in open-source domain) seem to have forged ahead!

Shukriya.

-Muthu

 

 

Classifying Tamil words – part 2

Recap

Continuing from previous post (see part-1) I am sharing my results on classifying a Tamil alphabet sequence as a valid Tamil-like word or English-like word using a binary classifier.

Pre-requisities

You need to get scikit-learn API installed by following directions on website here.

pip install -U scikit-learn

This will also get dependencies like Numpy and other Python libraries supporting the SciKit learn.

Next ensure your installation is okay by typing,

python -m sklearn

which should run without any output if all your settings are okay.

Training the AI Classifier

To train the classifier based on multi-layer perceptron (in other words – an AI neural network)

  1. we need to represent our input as a CSV file, with each sampled encoded as a feature of rows.
    • for this case the data are in the form of CSV files representing features of Jaffna, Azhagi, Combinational transliterated output of input words
    • See: files ‘english_dictionary_words.azhagi’ and ‘tamilvu_dictionary_words.txt’ at repo open-tamil/examples/classifier
  2. each word (represented as features) will also be given training label usually as integer, forming a column data on CSV file (across all samples); typical features encoded for the data file are defined in class Field under file ‘classifier/preprocess.py’;
    • Typically the information for each word like number of letters, juries, medics, ayutha letters, vallinams, mellinams, idayinams, first, last and vowels are stored in feature record within CSV.
    • We can generate various feature records of the data files by running the code of preprocessor.py
  3. next we may train the neural network using the Scikit learn API,
    • this is key code in ‘classifier/modelprocess2.py’
    • first we load the CSV feature vectors into Python as Numpy array for both class-0 (English words) and class-‘1’ (Tamil)
    • next we setup scaling of data sets for both classes
    • we pick test set, and training set which are key information to getting a good model network and generalized fit
    • We import various tools out of scikit learn like input scaler ‘StandardScalar’, ‘train_test_split’ etc for keeping up with good training conventions
    • Since we are doing classification both test and training inputs need to be scaled but not the label data
  4. Next step we setup a 3-layer neural network with ‘lbfgs’ activation function. We can fit this data with X_train data  and corresponding Y_train labels
    • nn = MLPClassifier(hidden_layer_sizes=(8,8,7),solver=lbfgs)
      nn.fit(X_train,Y_train)

      Y_pred = nn.pred( X_test )

      print(” accuracy => “,accuracy_score(Y_pred.ravel(),Y_test)

  5. The fitted neural network is capable of generating a score (goodness of fit), and immediately serialized into disk for future references; we also output diagnostic informations like,
    • confusion matrix
    • classification report
  6. Next we use the training neural network to show the results of  a few known inputs.
screen-shot-2017-12-20-at-2-24-21-am.png

Fig. 2: 89% accuracy trained classifier with correct identification of word “Hello”; while both are acceptable in native script form it is a English origin word!

  1. Key points for this prediction with ANN are to keep the input transformed as a feature vector before applying it to the classifier input
  2. Once the training is complete we see results like in item [6].

Finally we can automatically tell (via a neural network) if computer is a Tamil or English origin word; there is some sensitivity in this decision due to the 10% error. I have a screenshot of the predictions for various words (feature vectors are written as output as well)

Screen Shot 2017-12-20 at 2.28.35 AM.png

Fig. 3: Neural Network prediction of Tamil words and English (transliterated into Tamil) words

Finally we would like to conclude saying various types of Artificial Neural Network topologies and hidden-layer sizes were used but we chose to stick with simplest. At this time this trained neural network seems like a quite satisfying, and even ready to use for practical purposes.

Conclusion

Scikit-learn provides powerful framework to train and build classification neural networks.

This work has shown easy classification with 10% false-alarm rate (or ~90% classification rate) of various Tamil/English vocabularies and out of training vocabulary sets. The source codes are provided at open-tamil site including the various CSV data etc.

Goodluck, to exploring Neural Networks. Getting beyond 90% in this task seemed hard, and long way to go.

Classifying Tamil words – part 1

Problem

One of problems faced when building a Tamil spell checker, albeit somewhat marginal, can be phrased as follows:

Given a series of Tamil alphabets, how do you decide if the letters are true Tamil word (even out of dictionary) or if it is a transliterated English word ?

e.g. Between the words, ‘உகந்த’ vs ‘கம்புயுடர்’ can you decide which is true Tamil word and which is transliterated ?

Tools

This is somewhat simple with help of a neural network; given sufficient “features” and “training data” we can train some of these neural networks easily. With current interest in this area, tools are available to make this task quite easy – any of Pandas, Keras, PyTorch and Tensorflow may suffice.

Generally, the only thing you need to know about Artificial Intelligence (AI) is that machines can be trained to do tasks based on two distinctive learning processes:

  1. Regression,
  2. Classification

Read more at the Wikipedia – the current “problem” is a classification task.

Features

Naturally for task of classifying a word, we may take features as following:

  1. Word length
  2. Are all characters unique ?
  3. Number of repeated characters ?
  4. Vowels count, Consonant count
    1. In Tamil this information is stored as (Kuril, Nedil, Ayudham) and (Vallinam, Mellinam and Idayinam)
  5. Is word palindrome ?
  6. We can add bigram data as features as next step

Basically this task can be achieved with new code checked into Open-Tamil 0.7 (dev version) called ‘tamil.utf8.classify_letter

Screen Shot 2017-12-17 at 1.03.03 PM.png

Data sets

To make data sets we can use Tamil VU dictionary as a list of valid Tamil words (label 1); next we can use a transliterated list of words from English into Tamil as list of invalid Tamil words (label 0).

Using a 1, 0 labeled data, we may use part of this combined data for training the neural network with gradient descent algorithm or any other method for building a supervised learning model.

Building Transliterated Data

Using the Python code below and the data file from open-tamil repository you can build the code and run it,

def jaffna_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(jaffna.Transliteration.table,eng_string)
  return tamil_tx

def azhagi_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(azhagi.Transliteration.table,eng_string)
  return tamil_tx

def combinational_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(combinational.Transliteration.table,eng_string)
  return tamil_tx

# 3 forms of Tamil transliteration for English word
jfile = codecs.open('english_dictionary_words.jaffna','w','utf-8')
cfile = codecs.open('english_dictionary_words.combinational','w','utf-8')
afile = codecs.open('english_dictionary_words.azhagi','w','utf-8')
with codecs.open('english_dictionary_words.txt','r') as engf:
for idx,w in enumerate(engf.readlines()):
  w = w.strip()
  if len(w) < 1:
    continue
  print(idx)
  jfile.write(u"%s\n"%jaffna_transliterate(w))
  cfile.write(u"%s\n"%combinational_transliterate(w))
  afile.write(u"%s\n"%azhagi_transliterate(w))
  jfile.close()
  cfile.close()
  afile.close()

to get the following data files (left pane shows ‘Jaffna’ transliteration standard, while the right pane shows the source English word list); full gist on GitHub at this link

Screen Shot 2017-12-17 at 1.47.42 PM.png

In the next blog post I will share the details of training the neural network and building this classifier. Stay tuned!