This year I had chance to speak at my undergraduate institution – a well recognized engineering school in Trichy, India – about various things concerning my professional development and understanding of Science, Engineering and innovation in my short career as software developer and scientist-in-training.
Primarily, my goal was to communicate the tournament model and how we may enjoy our time in educational institutions pursuing a quest for truth regardless of some of the outcomes – just because they are governed by the tournament model.
Consider your task: to pick a winner in 2-player games from a group of N (say 128 or 64 players – like a typical Tennis tournament [or teams of smaller sizes for IPL or World Cup cricket tournaments]) then goal is to organize the games as a championship format with league rounds and knock-out tournaments to eventual final which decides the winner. This is the tournament model.
An alternate version where number of teams/players participating is not a power of 2, we may setup the model as follows algorithm/pseudocode;
- Enter all teams/payers in a double-ended-queue [deque]
- Select first-2 teams in queue and let them play;
- Take the winner of this game and enqueue to the end of queue; discard the loser (obviously!)
- Now we have N-1 teams/players in the queue.
- Repeat steps 2-4, till number of players is 1.
- We have a winner!
Key insight of tournament model is the fact that small differences between entities participating in the model can be amplified by the model making winners, and effects like the Matthew effect can ensure initial advantages snowball over time [esp. in industries like entertainment, social networking etc.]
The tournament model decides frequency of India vs Pakistan cricket matches, why Nadal vs Federer is most likely grand-slam final match up; the system decides success of professional actors and actresses. Why are Kamal Hassan and Rajinikanth more famous than other talented male actors of their generation (e.g. Sathyaraj, Karthik, Prabhu, etc.)[not to mention other female actresses – a whole other question]. Modern day movie star rivalries are also plenty, to wit – Danush vs Simbu etc. in their ascent to fame.
Many principles of randomness of outcomes, and regression toward mean explain the outcomes in retrospect; but none of the techniques have an ability to explain these phenomenon in a predictive manner which one may seek.
Hence as students approaching a potentially lifetime of work in field of engineering or science, I recommend everyone to aspire to understand the fundamental pieces – to learn the instruments, notes, chords, scales of their musical pieces – not just the piece itself- so in the future you can compose your own orchestral music; so that you can build tools for future challenges that you may face – surely different from challenges you were taught to resolve – using an open ended approach to learning.
Tournament model also helps you handle failures – be it product, strategy, problem areas in life. Usually, losing at something is by not making the grade or placing second or being edge out is by being marginally “less” in some way, shape or form, compared to competition.
What is your experience with managing technology projects, and their outcomes ? Leave your comment below.
A.I./ML for Hindi Language Processing
Sometimes its good to look around and learn from what’s happening in other realms of Indian language processing. In my limited experience language efforts in computing for Indian language revolve around the Dravidian languages, Bengali, Marathi or Hindi. சில நேரங்களில் குண்டு சட்டியில் குதிரை ஓட்டுரமாதிரி கணினி மொழியியல் ஆயிடக்கூடாது – தனிபட்டபடியும் சரி – மொழிகளுக்கிடையிலும் சரி.
Some good project efforts in Hindi Language processing (open-source) are reviewed in this blog; [there are projects like open-tamil API for Hindi, e.g. a get_letters like function, provided by tokenizer project here (with caveat that it is a small function only compared to expansive open-tamil), but we talk about the ML/A.I. focused projects here].
- Hindi word embedding called Hindi2vec (along lines of word2vec project). The idea is to associate similar words (e.g. ‘பல்’,’நாக்கு’,’வாய்’) with similar vectors within a neighborhood of each other using concepts of linear-algebra – vector spaces and matrices. So when you search or mistype or want to classify there is a neighborhood of known words closer to the potentially unknown word input from the user; such known neighborhood identification can help decision making and drive various learning, classification or dialogue systems.
- Hindi Transliteration Model project and the DeepTrans project– this is a really cool where they developed a reference data set of English to Hindi and trained a model for transliteration from English to Hindi of user input.
- We can do this in Tamil with the as we have many transliteration schemes as set out in open-tamil, but the even a same user is not strictly going to follow the scheme strictly, nor do different users follow the same scheme – in all these cases a machine learning A.I. model maybe more robust by virtue of learning the underlying rules. Very interesting project, and fairly simple to implement for Tamil from open-tamil transliterate module and SciKit Learn or other frameworks with high 95% correct prediction rate.
- Hindi-English parallel dictionary with 8MB size (probably 500,000 words or so I imagine) here – this can be a good jump starting point for translation projects if such existed for Tamil. e.g. Can we have a parallel dictionary English – Tamil for the simple TVU word list/dictionary ?
- Hindi Sentiment Analysis project does a ternary [good, bad, neutral] classification of text. They do this by using a CDAC-model which is super curious to me; maybe CDAC-India (Pune) has a Tamil POS-Tagger too ? Probably they do.
- Tamil POS-Taggers widely reported; AU-KBC Chennai has a POS-Tagger, probably the best for Tamil; Dr. Vasu Renganathan has a POS-Tagger, but both these works are not available currently for open-source use, however their techniques are openly shared via their papers in INFITT conferences.
- Sorkandu project can also be revived for making an open-source POS-Tagger
- Emotion Recognition in Hindi Speech project – this work from IIT KGP students builds a reference audio data set with known emotion labels and build some kind of a machine learning model, and then they get 5x better than random coin-toss/guess for the audio emotion recognition from speech.
- We probably don’t have any work on this direction in the open, but interestingly NIST in USA sponsored a Tamil Key Word Search (KWS), reports of which were published by a Singapore team in academic journals. More interestingly the KWS challenge released 2 hrs of speech data with tagged information. In USA, government released data usually qualifies for public-domain – e.g. pictures from NASA etc. so maybe there is a way to get this data. கடவுளுக்கு தான் வெளிச்சம்!
While we know, Google ASR, Youtube online translation of English videos into Tamil closed-captioning, foreign languages to Tamil Translation, Transliteration inputs all use perhaps the most advanced models in Tensorflow on cloud hardware, none of this technology is directly usable for free – maybe for a price via their Google cloud API offerings – and we probably don’t know all the details of how they achieved these magical software applications for Tamil language – anyones guess like mine is using the massive data sets they have from our Tamil news groups, emails, websites, and user input + Tensorflow A.I / ML magic. At least, we have to be grateful for Google-aandavar like some friends commented on freetamilcomputing group. 🙂
Surprisingly, to my knowledge, there are no planned efforts, ongoing or completed open-source projects like these in Tamil. Maybe another avenue for growth, and in this case Hindi projects (at least in open-source domain) seem to have forged ahead!
I’m somewhat late to hear this news, but recently there was a bug in iMessage application of iOS and Mac platforms causing it to crash/freeze up as reported here.
The crashing character consists of 5 codepoints.
U+0C1C [Lo] TELUGU LETTER JA
U+0C4D [Mn] TELUGU SIGN VIRAMA
U+0C1E [Lo] TELUGU LETTER NYA
U+200C [Cf] ZERO WIDTH NON-JOINER
U+0C3E [Mn] TELUGU VOWEL SIGN AA
and has a complex ligature form.
Sadly the poor usage of Telugu language Unicode block is potential cause to discover the bug so late; however the root cause of bug is some unsightly buffer overflow in Apple’s codebase.
Who knows if Tamil Unicode has any equivalent horrible bugs living in Android/Apple/Mac/Linux/Windows platforms. Hopefully not.
மக்கள் செல்வன் சொல் தேடல்
விடைகள் தேவையா ? ஈசிதாங்க! நீங்களும் முயற்சிக்க இங்கு http://tamilpesu.us/
[This article originally appeared in the 2017 Tamil Internet conference, UT-SC, Toronto, Canada, magazine ]
The current hot trend in AI revolution is “deep learning” – which is a fancy way of talking about multi-layered convolutional neural networks; this field of study has heralded a new age in computing extending human capabilities by automation and intelligent machines .
These neural networks aren’t the same as neuron networks in your brain! We are talking about artificial neural networks which reside in computers and tries to mimic the biological neural network with its synapses (connections) of axons, dendrons and their activation potentials. These thinking machines have their beginnings in post WW-II research at MIT, in the work of Seymour Papert who introduced “Perceptrons,” and Norbert Weiner’s “Cybernetics”.
But do we know why there is sudden interest in these biologically inspired computer models ? It is due to GPUs which has accelerated all the complex computations associated with neural networks for it be practical in such a large scale. They allow these networks to operate on gigabytes (or even terabytes) of data and have significantly reduced the computation time from months to days, or days to hours, or hours to minutes usually by an order of magnitude – not possible in an earlier generation of computing. Before we jump into the details let us understand why we need deep learning and convolutional neural networks in the first place.
Science and engineering have traditionally advanced by our ability to understand phenomena in natural world and describe them mathematically, since the times of Leonardo Da Vinci, Nicolas Copernicus, Galileo Galilei, Tycho Brahe, Johannes Kepler and Isaac Newton. However gaining models through experimentation and scientific breakthroughs piece-meal for each problem at hand is a slow process. Outside of Physics and Mathematics the scientific method is largely driven by an empirical approach.
It is in such pursuits of building models of unknown processes where observational data far exceed our human intelligence to divine an analytical model, the advent of deep learning and GPU based multi-layered neural networks provide an ad-hoc computable model. System identification for particular classification tasks, image recognition, and speech recognition to the modern miracle of a self-driving cars are all enabled by deep learning technology. All this came about due to the seminal work of many innovators culminating in the discovery of efficient convolutional neural networks by Prof. Geoff Hinton, who trained them by hardware acceleration via GPUs.
An original pioneer in the field of AI, before the AI winter, Prof. Geoff Hinton and co-workers  recently showed deep learning models that beat status-quo benchmarks on classification and prediction tasks on the following speech, text or image datasets: Reuters, TIMIT, MNIST, CIFAR and ImageNet, setting off the renewed interest in the field of AI from academia and industry giants – Google, Microsoft, Baidu and Facebook alike .
GPU stands for Graphics Processing Unit . These were originally designed for graphics rendering used in video games in 1990s. They have a large number of parallel cores which are very efficient for doing simple mathematical computations like matrix multiplications. These computations are the fundamental basis for machine learning methods such as deep learning. While the improvement in CPUs over years has slowed down over the years as Moore’s law has hit a bottleneck, the GPUs increase in performance has continued unabated showing tremendous improvements over the generations.
Figure. 1 (left): Deep Learning training task times as function of various GPU processors from NVidia. Figure. 2(right): AlexNet training throughput for 20 iterations on various CPU/GPU processing platforms.
Such GPUs were originally invented for shading algorithms algorithms, are now applied in training large machine learning models using a Open CL or CUDA like frameworks (variants of C-language with description for parallel execution via threading) from the vendors.
The pioneering hardware vendors include Nvidia with their GPU series like GeForce, Tesla; AMD with its Radeon, GP GPU, Google has entered this race with its TPU (Tensor Processing Unit) and some offerings from Intel for ML training applications. Nvidia and AMD are the main players in the GPU space with Nvidia laying special emphasis on parallel computing and deep learning over the years. Nvidia just announced the new Volta generation chip based GPU V100 which is about 2.5 x faster than the previous generation chip Pascal GP100 which was announced less than 2 years ago .Compared to CPU, however GPUs are more than 50x faster for Deep learning. Performance of GPUs as function of various GPU families in shown in Figure. 1, and for another AlexNet data set is shown in Figure. 2.
If the Harvard architecture and RISC architecture based CPUs have been workhorses of personal computer revolution, then the advent of high framerate video-gaming pushed the CPU based graphics rendering from CPU + Video card based rendering to CPU + GPU, to CPU + GPU + GP-GPU (general purpose GPU); some of this overview is shown in Figure. 3a, 3b.
GPU’s are suitable for large numerical algorithms where various data have to be moved through a computational pipeline often in parallel; this SIMD problem, like genome sequencing shown in Figure. 3c, when solved by GPU gain the maximum speedup/acceleration. However, there is a fundamental limitations of GPU acceleration due to the Amdahl’s law which saturates the parallelization upto the available serial bottlenecks for a given computational task.
To build a deep learning application one may use their labeled datasets to build a learning model on any of the various frameworks  (both open-source or closed) provided from competing vendors in the industry as follows:
TensorFlow, developed by google, python API over C++ engine, low level api, good for researchers, not commercially supported; notably Google is in process of developing a TPU – an advanced version of GPU for direct use with TensorFlow.
Caffe 2, developed by UC Berkeley used at Facebook among other places, focussed on computer vision, one of the earlier frameworks to gain significant adoption, Python API over C++ and CUDA code
Scikit Learn (Python based) general inference and machine-learning framework
Theano written in python, grand-daddy of deep learning frameworks
Tamil applications for deep learning including providing or improving existing solutions to the problems of,
- Tamil Speech Recognition
- Tamil Character Recognition [7,8]
- Natural Language Processing for Tamil
Hardware acceleration and availability of big-data (labeled datasets) will play key role in the success of applying deep learning techniques to these problems.
Jensen Huang, “Accelerating AI with GPUs: A New Computing Model,” link
G. E. Hinton et-al. “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems (2012).
LeCun, Y., Bengio, Y. and Hinton, G. E., “Deep Learning” Nature, Vol. 521, pp 436-444. (2015), link.
GPU definition at PC Magazine Encyclopedia, PC Magazine, (2017) link.
Tesla GPU Application notes from NVidia, (2017) link.
“Comparing deep learning frameworks”, Deeplearning4j.org (2017), link.
Prashanth Vijayaraghavan, Mishra Sra, “Handwritten Tamil Recognition using a Convolutional Neural Network,” NEML Poster (2015) link.
R. Jagadeesh Kannan, S. Subramanian, “An Adaptive Approach of Tamil Character Recognition Using Deep Learning with Big Data-A Survey”, Proceedings of 49th Annual Convention of Computer Society of India (vol. 1) pp 557-567 (2015), link.
Continuing from previous post (see part-1) I am sharing my results on classifying a Tamil alphabet sequence as a valid Tamil-like word or English-like word using a binary classifier.
You need to get scikit-learn API installed by following directions on website here.
pip install -U scikit-learn
This will also get dependencies like Numpy and other Python libraries supporting the SciKit learn.
Next ensure your installation is okay by typing,
python -m sklearn
which should run without any output if all your settings are okay.
Training the AI Classifier
To train the classifier based on multi-layer perceptron (in other words – an AI neural network)
- we need to represent our input as a CSV file, with each sampled encoded as a feature of rows.
- for this case the data are in the form of CSV files representing features of Jaffna, Azhagi, Combinational transliterated output of input words
- See: files ‘english_dictionary_words.azhagi’ and ‘tamilvu_dictionary_words.txt’ at repo open-tamil/examples/classifier
- each word (represented as features) will also be given training label usually as integer, forming a column data on CSV file (across all samples); typical features encoded for the data file are defined in class Field under file ‘classifier/preprocess.py’;
- Typically the information for each word like number of letters, juries, medics, ayutha letters, vallinams, mellinams, idayinams, first, last and vowels are stored in feature record within CSV.
- We can generate various feature records of the data files by running the code of preprocessor.py
- next we may train the neural network using the Scikit learn API,
- this is key code in ‘classifier/modelprocess2.py’
- first we load the CSV feature vectors into Python as Numpy array for both class-0 (English words) and class-‘1’ (Tamil)
- next we setup scaling of data sets for both classes
- we pick test set, and training set which are key information to getting a good model network and generalized fit
- We import various tools out of scikit learn like input scaler ‘StandardScalar’, ‘train_test_split’ etc for keeping up with good training conventions
- Since we are doing classification both test and training inputs need to be scaled but not the label data
- Next step we setup a 3-layer neural network with ‘lbfgs’ activation function. We can fit this data with X_train data and corresponding Y_train labels
nn = MLPClassifier(hidden_layer_sizes=(8,8,7),solver=‘lbfgs‘) nn.fit(X_train,Y_train)
Y_pred = nn.pred( X_test )
print(” accuracy => “,accuracy_score(Y_pred.ravel(),Y_test)
- The fitted neural network is capable of generating a score (goodness of fit), and immediately serialized into disk for future references; we also output diagnostic informations like,
- confusion matrix
- classification report
- Next we use the training neural network to show the results of a few known inputs.
- Key points for this prediction with ANN are to keep the input transformed as a feature vector before applying it to the classifier input
- Once the training is complete we see results like in item .
Finally we can automatically tell (via a neural network) if computer is a Tamil or English origin word; there is some sensitivity in this decision due to the 10% error. I have a screenshot of the predictions for various words (feature vectors are written as output as well)
Finally we would like to conclude saying various types of Artificial Neural Network topologies and hidden-layer sizes were used but we chose to stick with simplest. At this time this trained neural network seems like a quite satisfying, and even ready to use for practical purposes.
Scikit-learn provides powerful framework to train and build classification neural networks.
This work has shown easy classification with 10% false-alarm rate (or ~90% classification rate) of various Tamil/English vocabularies and out of training vocabulary sets. The source codes are provided at open-tamil site including the various CSV data etc.
Goodluck, to exploring Neural Networks. Getting beyond 90% in this task seemed hard, and long way to go.