UrbanTamil.com + open-tamil = Tamil Vocabulary

This week, following some interesting gains in the open-tamil project, and updates to urbantamil.com  I am writing about some interesting word-frequency analysis of the Tamil vocabulary.

Using the historgram technique, binning up the Tamil letters by word-length, from the UrbanTamil.com corpus of 63222 words, we see a chart like the one below.

Tamil word frequencies by word-length

Tamil word frequencies by word-length, from the UrbanTamil.com corpus of 63222 words.

You can observe a few things,

  1. Most of Tamil words are between 4-7 letters long, roughly 80% of 63222
  2. Tamil has a lot of 1-letter “words” – 123 to be precise of the allowed 247 + Sanskrit/Granta letters
  3. Longest Tamil words (3 of them) in the UrbanTamil dictionary is 15 letters long – 1. புத்திரபௌத்திரபாரம்பரியம், 2. முப்பொழுதுந்திருமேனிதீண்டுவார், 3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்
  4. This is a power-law distribution but not the same as Zipf’s law distribution, which we could analyze in a future post.
  5. We can also compute the letter probability from Tamil texts using the open-tamil library

Thanks for reading. Have a good week ahead.

நன்றி வணக்கம்,
-முத்து

One thought on “UrbanTamil.com + open-tamil = Tamil Vocabulary

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s