Tamil Language – Longest word and Lexicography

Hello traveler and Tamil language aficionado. Today I’m researching about Tamil lexicography, and I’m sharing the results of my searches through this blog post. It is more on the research side, than demo’s or expository blogs of the post.

Longest Tamil Word

  1. Senthil Nathan of Arithi.com has blogged about using UTF-8 in Tamil text processing. Something we like at the open-tamil project. Check our Python codes if you have not already.
  2. In this article he posits the longest Tamil word has to be the proper-noun, “திருவாலவாயுடையார்திருவிலையாடற்புராணம்“. Any comments on that? I think if we looked at verbs of adjectives we may reach the proper answer.  Lets try and answer this question with the open-tamil tools (assuming you have installed it!) and type the code at the Python shell.
>> import tamil
>> len(tamil.utf8.get_letters(u'திருவாலவாயுடையார்திருவிலையாடற்புராணம்'))

     3. Now we realize this is only 20 letters long. Comparatively the English word ‘pneumonoultramicro silico coniosis‘, a disorder          where the lungs are affected by silicion particulate matter, measures to be upto a whooping, 45 letters long!

 Update #2 – Longest Tamil word! (04/28/2014)

Since the original post I have a possible candidate word (not a proper-noun) which is 15 letters long, “புத்திரபௌத்திரபாரம்பரியம்”. Look up புத்திரபௌத்திரபாரம்பரியம். See also words,

  1. புத்திரபௌத்திரபாரம்பரியம்
  2. முப்பொழுதுந்திருமேனிதீண்டுவார்
  3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்

Lexicographic Order – Dictionary Order

In English language ‘AVOCADO’ comes after the word ‘APPLE’ in the dictionary, because of the dictionary-order or lexicographic convention. It is often preplexing to me that Tamil language sorting is not well defined.

  1. Our vowels, 12 of ‘அ,ஆ,இ,ஈ, – ஒ,ஓ,ஔ,ஃ’ are well ordered.
  2. But the consonants, 18 of ‘க,ச,ட,த,ப,ற, … ஞ.ங,ண,ந,ம,ன’ are not because there is more than one ordering. What is the norm here?
  3. So in combination the 247 Tamil letters don’t have a canonical dictionary order.
  4. This lack of lexicographic ordering convention makes dictionary ordering of Tamil words difficult. Clearly we could make a choice, but what is the norm?
  5. What is your language solution? Share your comments and strategies.

Afterword (Update)

  1. Turns out it is not too hard to implement Lexicographic ordering in Python/Open-Tamil
  2. Code requires you to define a comparison function and use it with the sort() method. Comparison method knows the relative ordering of the letters in Tamil character, as you see in this commit, defining the functions
    1. def compare_words_lexicographic( word_a, word_b ):
    2. def all_tamil( words ):
  3. Turns out all this is pretty neat stuff for the Tamil text processing in open-tamil/utf-8 package!


