UrbanTamil.com + open-tamil = Tamil Vocabulary

This week, following some interesting gains in the open-tamil project, and updates to urbantamil.com  I am writing about some interesting word-frequency analysis of the Tamil vocabulary.

Using the historgram technique, binning up the Tamil letters by word-length, from the UrbanTamil.com corpus of 63222 words, we see a chart like the one below.

Tamil word frequencies by word-length
Tamil word frequencies by word-length, from the UrbanTamil.com corpus of 63222 words.

You can observe a few things,

  1. Most of Tamil words are between 4-7 letters long, roughly 80% of 63222
  2. Tamil has a lot of 1-letter “words” – 123 to be precise of the allowed 247 + Sanskrit/Granta letters
  3. Longest Tamil words (3 of them) in the UrbanTamil dictionary is 15 letters long – 1. புத்திரபௌத்திரபாரம்பரியம், 2. முப்பொழுதுந்திருமேனிதீண்டுவார், 3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்
  4. This is a power-law distribution but not the same as Zipf’s law distribution, which we could analyze in a future post.
  5. We can also compute the letter probability from Tamil texts using the open-tamil library

Thanks for reading. Have a good week ahead.

நன்றி வணக்கம்,

One thought on “UrbanTamil.com + open-tamil = Tamil Vocabulary

மறுமொழியொன்றை இடுங்கள்

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  மாற்று )

Twitter picture

You are commenting using your Twitter account. Log Out /  மாற்று )

Facebook photo

You are commenting using your Facebook account. Log Out /  மாற்று )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.