Recap
Continuing from previous post (see part-1) I am sharing my results on classifying a Tamil alphabet sequence as a valid Tamil-like word or English-like word using a binary classifier.
Pre-requisities
You need to get scikit-learn API installed by following directions on website here.
pip install -U scikit-learn
This will also get dependencies like Numpy and other Python libraries supporting the SciKit learn.
Next ensure your installation is okay by typing,
python -m sklearn
which should run without any output if all your settings are okay.
Training the AI Classifier
To train the classifier based on multi-layer perceptron (in other words – an AI neural network)
- we need to represent our input as a CSV file, with each sampled encoded as a feature of rows.
- for this case the data are in the form of CSV files representing features of Jaffna, Azhagi, Combinational transliterated output of input words
- See: files ‘english_dictionary_words.azhagi’ and ‘tamilvu_dictionary_words.txt’ at repo open-tamil/examples/classifier
- each word (represented as features) will also be given training label usually as integer, forming a column data on CSV file (across all samples); typical features encoded for the data file are defined in class Field under file ‘classifier/preprocess.py’;
- Typically the information for each word like number of letters, juries, medics, ayutha letters, vallinams, mellinams, idayinams, first, last and vowels are stored in feature record within CSV.
- We can generate various feature records of the data files by running the code of preprocessor.py
- next we may train the neural network using the Scikit learn API,
- this is key code in ‘classifier/modelprocess2.py’
- first we load the CSV feature vectors into Python as Numpy array for both class-0 (English words) and class-‘1’ (Tamil)
- next we setup scaling of data sets for both classes
- we pick test set, and training set which are key information to getting a good model network and generalized fit
- We import various tools out of scikit learn like input scaler ‘StandardScalar’, ‘train_test_split’ etc for keeping up with good training conventions
- Since we are doing classification both test and training inputs need to be scaled but not the label data
- Next step we setup a 3-layer neural network with ‘lbfgs’ activation function. We can fit this data with X_train data and corresponding Y_train labels
-
nn = MLPClassifier(hidden_layer_sizes=(8,8,7),solver=‘lbfgs‘) nn.fit(X_train,Y_train) Y_pred = nn.pred( X_test )
print(” accuracy => “,accuracy_score(Y_pred.ravel(),Y_test)
-
- The fitted neural network is capable of generating a score (goodness of fit), and immediately serialized into disk for future references; we also output diagnostic informations like,
- confusion matrix
- classification report
- Next we use the training neural network to show the results of a few known inputs.

Fig. 2: 89% accuracy trained classifier with correct identification of word “Hello”; while both are acceptable in native script form it is a English origin word!
- Key points for this prediction with ANN are to keep the input transformed as a feature vector before applying it to the classifier input
- Once the training is complete we see results like in item [6].
Finally we can automatically tell (via a neural network) if computer is a Tamil or English origin word; there is some sensitivity in this decision due to the 10% error. I have a screenshot of the predictions for various words (feature vectors are written as output as well)

Fig. 3: Neural Network prediction of Tamil words and English (transliterated into Tamil) words
Finally we would like to conclude saying various types of Artificial Neural Network topologies and hidden-layer sizes were used but we chose to stick with simplest. At this time this trained neural network seems like a quite satisfying, and even ready to use for practical purposes.
Conclusion
Scikit-learn provides powerful framework to train and build classification neural networks.
This work has shown easy classification with 10% false-alarm rate (or ~90% classification rate) of various Tamil/English vocabularies and out of training vocabulary sets. The source codes are provided at open-tamil site including the various CSV data etc.
Goodluck, to exploring Neural Networks. Getting beyond 90% in this task seemed hard, and long way to go.