This week, following some interesting gains in the open-tamil project, and updates to urbantamil.com I am writing about some interesting word-frequency analysis of the Tamil vocabulary.
Using the historgram technique, binning up the Tamil letters by word-length, from the UrbanTamil.com corpus of 63222 words, we see a chart like the one below.

You can observe a few things,
- Most of Tamil words are between 4-7 letters long, roughly 80% of 63222
- Tamil has a lot of 1-letter “words” – 123 to be precise of the allowed 247 + Sanskrit/Granta letters
- Longest Tamil words (3 of them) in the UrbanTamil dictionary is 15 letters long – 1. புத்திரபௌத்திரபாரம்பரியம், 2. முப்பொழுதுந்திருமேனிதீண்டுவார், 3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்
- This is a power-law distribution but not the same as Zipf’s law distribution, which we could analyze in a future post.
- We can also compute the letter probability from Tamil texts using the open-tamil library
Thanks for reading. Have a good week ahead.
நன்றி வணக்கம்,
-முத்து
அர்பன்தமிழ் தளம் தற்போது ஏன் இயங்குவதில்லை? ஏதேனும் நுட்பச் சிக்கலா?