Classifying Tamil words – part 2


Continuing from previous post (see part-1) I am sharing my results on classifying a Tamil alphabet sequence as a valid Tamil-like word or English-like word using a binary classifier.


You need to get scikit-learn API installed by following directions on website here.

pip install -U scikit-learn

This will also get dependencies like Numpy and other Python libraries supporting the SciKit learn.

Next ensure your installation is okay by typing,

python -m sklearn

which should run without any output if all your settings are okay.

Training the AI Classifier

To train the classifier based on multi-layer perceptron (in other words – an AI neural network)

  1. we need to represent our input as a CSV file, with each sampled encoded as a feature of rows.
    • for this case the data are in the form of CSV files representing features of Jaffna, Azhagi, Combinational transliterated output of input words
    • See: files ‘english_dictionary_words.azhagi’ and ‘tamilvu_dictionary_words.txt’ at repo open-tamil/examples/classifier
  2. each word (represented as features) will also be given training label usually as integer, forming a column data on CSV file (across all samples); typical features encoded for the data file are defined in class Field under file ‘classifier/’;
    • Typically the information for each word like number of letters, juries, medics, ayutha letters, vallinams, mellinams, idayinams, first, last and vowels are stored in feature record within CSV.
    • We can generate various feature records of the data files by running the code of
  3. next we may train the neural network using the Scikit learn API,
    • this is key code in ‘classifier/’
    • first we load the CSV feature vectors into Python as Numpy array for both class-0 (English words) and class-‘1’ (Tamil)
    • next we setup scaling of data sets for both classes
    • we pick test set, and training set which are key information to getting a good model network and generalized fit
    • We import various tools out of scikit learn like input scaler ‘StandardScalar’, ‘train_test_split’ etc for keeping up with good training conventions
    • Since we are doing classification both test and training inputs need to be scaled but not the label data
  4. Next step we setup a 3-layer neural network with ‘lbfgs’ activation function. We can fit this data with X_train data  and corresponding Y_train labels
    • nn = MLPClassifier(hidden_layer_sizes=(8,8,7),solver=lbfgs),Y_train)

      Y_pred = nn.pred( X_test )

      print(” accuracy => “,accuracy_score(Y_pred.ravel(),Y_test)

  5. The fitted neural network is capable of generating a score (goodness of fit), and immediately serialized into disk for future references; we also output diagnostic informations like,
    • confusion matrix
    • classification report
  6. Next we use the training neural network to show the results of  a few known inputs.

Fig. 2: 89% accuracy trained classifier with correct identification of word “Hello”; while both are acceptable in native script form it is a English origin word!

  1. Key points for this prediction with ANN are to keep the input transformed as a feature vector before applying it to the classifier input
  2. Once the training is complete we see results like in item [6].

Finally we can automatically tell (via a neural network) if computer is a Tamil or English origin word; there is some sensitivity in this decision due to the 10% error. I have a screenshot of the predictions for various words (feature vectors are written as output as well)

Screen Shot 2017-12-20 at 2.28.35 AM.png

Fig. 3: Neural Network prediction of Tamil words and English (transliterated into Tamil) words

Finally we would like to conclude saying various types of Artificial Neural Network topologies and hidden-layer sizes were used but we chose to stick with simplest. At this time this trained neural network seems like a quite satisfying, and even ready to use for practical purposes.


Scikit-learn provides powerful framework to train and build classification neural networks.

This work has shown easy classification with 10% false-alarm rate (or ~90% classification rate) of various Tamil/English vocabularies and out of training vocabulary sets. The source codes are provided at open-tamil site including the various CSV data etc.

Goodluck, to exploring Neural Networks. Getting beyond 90% in this task seemed hard, and long way to go.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s