UrbanTamil.com + open-tamil = Tamil Vocabulary

This week, following some interesting gains in the open-tamil project, and updates to urbantamil.com  I am writing about some interesting word-frequency analysis of the Tamil vocabulary.

Using the historgram technique, binning up the Tamil letters by word-length, from the UrbanTamil.com corpus of 63222 words, we see a chart like the one below.

Tamil word frequencies by word-length
Tamil word frequencies by word-length, from the UrbanTamil.com corpus of 63222 words.

You can observe a few things,

  1. Most of Tamil words are between 4-7 letters long, roughly 80% of 63222
  2. Tamil has a lot of 1-letter “words” – 123 to be precise of the allowed 247 + Sanskrit/Granta letters
  3. Longest Tamil words (3 of them) in the UrbanTamil dictionary is 15 letters long – 1. புத்திரபௌத்திரபாரம்பரியம், 2. முப்பொழுதுந்திருமேனிதீண்டுவார், 3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்
  4. This is a power-law distribution but not the same as Zipf’s law distribution, which we could analyze in a future post.
  5. We can also compute the letter probability from Tamil texts using the open-tamil library

Thanks for reading. Have a good week ahead.

நன்றி வணக்கம்,
-முத்து

Tamil Language – Longest word and Lexicography

Hello traveler and Tamil language aficionado. Today I’m researching about Tamil lexicography, and I’m sharing the results of my searches through this blog post. It is more on the research side, than demo’s or expository blogs of the post.

Longest Tamil Word

  1. Senthil Nathan of Arithi.com has blogged about using UTF-8 in Tamil text processing. Something we like at the open-tamil project. Check our Python codes if you have not already.
  2. In this article he posits the longest Tamil word has to be the proper-noun, “திருவாலவாயுடையார்திருவிலையாடற்புராணம்“. Any comments on that? I think if we looked at verbs of adjectives we may reach the proper answer.  Lets try and answer this question with the open-tamil tools (assuming you have installed it!) and type the code at the Python shell.
>> import tamil
>> len(tamil.utf8.get_letters(u'திருவாலவாயுடையார்திருவிலையாடற்புராணம்'))
           20

     3. Now we realize this is only 20 letters long. Comparatively the English word ‘pneumonoultramicro silico coniosis‘, a disorder          where the lungs are affected by silicion particulate matter, measures to be upto a whooping, 45 letters long!

 Update #2 – Longest Tamil word! (04/28/2014)

Since the original post I have a possible candidate word (not a proper-noun) which is 15 letters long, “புத்திரபௌத்திரபாரம்பரியம்”. Look up புத்திரபௌத்திரபாரம்பரியம். See also words,

  1. புத்திரபௌத்திரபாரம்பரியம்
  2. முப்பொழுதுந்திருமேனிதீண்டுவார்
  3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்

Lexicographic Order – Dictionary Order

In English language ‘AVOCADO’ comes after the word ‘APPLE’ in the dictionary, because of the dictionary-order or lexicographic convention. It is often preplexing to me that Tamil language sorting is not well defined.

  1. Our vowels, 12 of ‘அ,ஆ,இ,ஈ, – ஒ,ஓ,ஔ,ஃ’ are well ordered.
  2. But the consonants, 18 of ‘க,ச,ட,த,ப,ற, … ஞ.ங,ண,ந,ம,ன’ are not because there is more than one ordering. What is the norm here?
  3. So in combination the 247 Tamil letters don’t have a canonical dictionary order.
  4. This lack of lexicographic ordering convention makes dictionary ordering of Tamil words difficult. Clearly we could make a choice, but what is the norm?
  5. What is your language solution? Share your comments and strategies.

Afterword (Update)

  1. Turns out it is not too hard to implement Lexicographic ordering in Python/Open-Tamil
  2. Code requires you to define a comparison function and use it with the sort() method. Comparison method knows the relative ordering of the letters in Tamil character, as you see in this commit, defining the functions
    1. def compare_words_lexicographic( word_a, word_b ):
    2. def all_tamil( words ):
  3. Turns out all this is pretty neat stuff for the Tamil text processing in open-tamil/utf-8 package!

-Muthu

Chrome இல் இருந்து தமிழ் கற்க

UrbanTamil.com - நீங்கள் ஏற்கனவே தமிழ் சமூக அகராதி

Chrome இல் இருந்து தமிழ் கற்க, உங்கள் தமிழ் சொல்லகராதி ஆராயுங்கள். தமிழ் மொழியையும் கற்க. தமிழ் வட்டார, மற்றும் பல்வேறு தமிழ் வட்டார பங்களிப்பு – அனைத்து உங்களது உலாவி ஆறுதல் இருந்து மதுரை, சென்னை, இலங்கை (இலங்கை) மற்றும் உன்னதமான மொழி.

1. நீங்கள், Google அங்காடி இருந்து கோப்புகளை வேண்டும் Chrome நீட்டிப்பு நிறுவ, இந்த இணைய செல்லலாம்

UrbanTamil Google Chrome extension
UrbanTamil Google Chrome extension

அறிவிப்பு (Announcement)

You may already be familiar with the Tamil social dictionary – UrbanTamil.com. In this blog post I am announcing the UrbanTamil.com Chrome extension

This is a open source Chrome extension.

1. To install the Chrome extension you need the files from Google webstore here

2. If you are a developer, you can get the code for the Published version of extension to web dictionary via  

நன்றி மற்றும் வணக்கம்,
-முத்து