Ezhil Language Conference Articles

I’m happy to report that our collaboration has come up with 3 conference submissions for INFITT-2014 in Puducherry, India. While the articles are under review, I would like to only share the title and pending acceptance, the whole paper will be posted soon.

Ezhil-abstracts-2014

1. Learning Ezhil Language via Web

    Technical article on progress made in Ezhil Language and EzhilLang.org

2. Open-Tamil text processing tools

    Report efforts at Open-Tamil Python library for Tamil text processing

3. தமிழில் எப்படி நிரல் எழுதுவது ? – எழில் இணைய கருத்துக்கணிப்பு

    Survey results from poll conducted on Tamil programming language keywords

We are keeping our hands x-ed and we’ll see the outcomes by end of July, 2014.

Thanks to collaborators, @nchokkan, @msathia and @tshrinivasan.

2014 INFITT Tamil Internet Conference hosted in Puducherry

Very excited to read that 2014 has another INFITT Tamil Conference hosted in Puducherry, India, set up during the Sep 19-21, 2014. Please read the call for papers to the conference,  and consider publishing your latest + greatest Tamil software work. INFITT 2014 Puducherry, India. If you choose to publish, hurry! the deadline is June 30th.

Image

Notably this year the conference has chosen to

  1. Focus on OpenSource Tamil Applications, and Software
  2. Focus on mobile platform Android, iOS software, Windows 8
  3. Let only in-person presentations at conference
  4. Cost of regristration @ $40 on or before June 30th, and $50 later.
  5. Consider becoming an INFITT member.

-Muthu

Read the whole press-release here

UrbanTamil.com + open-tamil = Tamil Vocabulary

This week, following some interesting gains in the open-tamil project, and updates to urbantamil.com  I am writing about some interesting word-frequency analysis of the Tamil vocabulary.

Using the historgram technique, binning up the Tamil letters by word-length, from the UrbanTamil.com corpus of 63222 words, we see a chart like the one below.

Tamil word frequencies by word-length

Tamil word frequencies by word-length, from the UrbanTamil.com corpus of 63222 words.

You can observe a few things,

  1. Most of Tamil words are between 4-7 letters long, roughly 80% of 63222
  2. Tamil has a lot of 1-letter “words” – 123 to be precise of the allowed 247 + Sanskrit/Granta letters
  3. Longest Tamil words (3 of them) in the UrbanTamil dictionary is 15 letters long – 1. புத்திரபௌத்திரபாரம்பரியம், 2. முப்பொழுதுந்திருமேனிதீண்டுவார், 3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்
  4. This is a power-law distribution but not the same as Zipf’s law distribution, which we could analyze in a future post.
  5. We can also compute the letter probability from Tamil texts using the open-tamil library

Thanks for reading. Have a good week ahead.

நன்றி வணக்கம்,
-முத்து

Tamil Language – Longest word and Lexicography

Hello traveler and Tamil language aficionado. Today I’m researching about Tamil lexicography, and I’m sharing the results of my searches through this blog post. It is more on the research side, than demo’s or expository blogs of the post.

Longest Tamil Word

  1. Senthil Nathan of Arithi.com has blogged about using UTF-8 in Tamil text processing. Something we like at the open-tamil project. Check our Python codes if you have not already.
  2. In this article he posits the longest Tamil word has to be the proper-noun, “திருவாலவாயுடையார்திருவிலையாடற்புராணம்“. Any comments on that? I think if we looked at verbs of adjectives we may reach the proper answer.  Lets try and answer this question with the open-tamil tools (assuming you have installed it!) and type the code at the Python shell.
>> import tamil
>> len(tamil.utf8.get_letters(u'திருவாலவாயுடையார்திருவிலையாடற்புராணம்'))
           20

     3. Now we realize this is only 20 letters long. Comparatively the English word ‘pneumonoultramicro silico coniosis‘, a disorder          where the lungs are affected by silicion particulate matter, measures to be upto a whooping, 45 letters long!

 Update #2 – Longest Tamil word! (04/28/2014)

Since the original post I have a possible candidate word (not a proper-noun) which is 15 letters long, “புத்திரபௌத்திரபாரம்பரியம்”. Look up புத்திரபௌத்திரபாரம்பரியம். See also words,

  1. புத்திரபௌத்திரபாரம்பரியம்
  2. முப்பொழுதுந்திருமேனிதீண்டுவார்
  3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்

Lexicographic Order – Dictionary Order

In English language ‘AVOCADO’ comes after the word ‘APPLE’ in the dictionary, because of the dictionary-order or lexicographic convention. It is often preplexing to me that Tamil language sorting is not well defined.

  1. Our vowels, 12 of ‘அ,ஆ,இ,ஈ, – ஒ,ஓ,ஔ,ஃ’ are well ordered.
  2. But the consonants, 18 of ‘க,ச,ட,த,ப,ற, … ஞ.ங,ண,ந,ம,ன’ are not because there is more than one ordering. What is the norm here?
  3. So in combination the 247 Tamil letters don’t have a canonical dictionary order.
  4. This lack of lexicographic ordering convention makes dictionary ordering of Tamil words difficult. Clearly we could make a choice, but what is the norm?
  5. What is your language solution? Share your comments and strategies.

Afterword (Update)

  1. Turns out it is not too hard to implement Lexicographic ordering in Python/Open-Tamil
  2. Code requires you to define a comparison function and use it with the sort() method. Comparison method knows the relative ordering of the letters in Tamil character, as you see in this commit, defining the functions
    1. def compare_words_lexicographic( word_a, word_b ):
    2. def all_tamil( words ):
  3. Turns out all this is pretty neat stuff for the Tamil text processing in open-tamil/utf-8 package!

-Muthu

Chrome இல் இருந்து தமிழ் கற்க

UrbanTamil.com - நீங்கள் ஏற்கனவே தமிழ் சமூக அகராதி

Chrome இல் இருந்து தமிழ் கற்க, உங்கள் தமிழ் சொல்லகராதி ஆராயுங்கள். தமிழ் மொழியையும் கற்க. தமிழ் வட்டார, மற்றும் பல்வேறு தமிழ் வட்டார பங்களிப்பு – அனைத்து உங்களது உலாவி ஆறுதல் இருந்து மதுரை, சென்னை, இலங்கை (இலங்கை) மற்றும் உன்னதமான மொழி.

1. நீங்கள், Google அங்காடி இருந்து கோப்புகளை வேண்டும் Chrome நீட்டிப்பு நிறுவ, இந்த இணைய செல்லலாம்

UrbanTamil Google Chrome extension

UrbanTamil Google Chrome extension

அறிவிப்பு (Announcement)

You may already be familiar with the Tamil social dictionary – UrbanTamil.com. In this blog post I am announcing the UrbanTamil.com Chrome extension

This is a open source Chrome extension.

1. To install the Chrome extension you need the files from Google webstore here

2. If you are a developer, you can get the code for the Published version of extension to web dictionary via  

நன்றி மற்றும் வணக்கம்,
-முத்து

Image

தமிழ் தெரியுமா? http://urbantamil.com/jumble

தமிழ் சொற்களை கற்றுக்கொள்ளுங்கள். உங்கள் சொல்லகராதி கற்றுக்கொள்ள, http://urbantamil.com/jumble , பயன்படுத்தலாம்.

தமிழ் தெரியுமா? http://urbantamil.com/jumble

இந்த படத்தில் குழம்பியிருக்கிறது வார்த்தையை யூகிக்க முடியுமா ?