Tamil Web 2.0 : தமிழ் இணையம் இரண்டாவது படி

தமிழ் இணையம், மற்றும் கணினி பயன்படுத்த முதல் படி நிறைவேற்றப்பட்டது. எழுத்துரு, எழுத்துரு, ஒழுங்கமைவு மற்றும் காட்சி பிரச்சினைகள் தீர்க்கப்பட்டது.

success_sudhamshu_4395260240

தமிழ் இணையம் இரண்டாவது படி, Tamil Web 2.0, அது எப்படி இருக்கும்?

  1. உயர் உரை விண்ணப்பங்கள்,
  2. ஆடியோ / வீடியோ விண்ணப்பங்கள்
  3. உயர் ஆர்டர் ஸ்மார்ட் போன் விண்ணப்பங்கள்?

நாம் இந்த மென்பொருள் உருவாக்க முடியுமா? நாம் அடுத்த நிலை அடைய முடியுமா?

60 மில்லியன் தமிழ் மக்கள், மேலும் இந்த மென்பொருள் பயன்படுத்த காத்திருக்கிறார்கள்.

வெகு காலம் வாழும் தமிழ்.

Open-Tamil : Text processing (உரை பகுப்பாய்வு)

ஆவணத்தில் பிரித்தெடுக்கப்பட்ட வார்த்தைகள் – 

Extracting words in document

In this short blog post, I want to show you how you can find the word frequency of a webpage using

  1. Open-Tamil,
  2. BeautifulSoup

in the Python framework.

Install Open-Tamil package

Open-Tamil provides tools to,

  1. parse and analyze text in UTF-8 format, in the package ‘tamil.utf8‘,
  2. it has packages for elementary ‘ngram‘ modeling in Tamil,
  3. format converters for TSCII -> Unicode in ‘tamil.tscii‘ package,
  4. transliteration utilities under ‘transliterate‘ package

but before you can get started you have to install the library.

You can try, the easy/recommended method,

  1. $ pip install open-tamil

or get latest from the github repository by,

  1. $ git clone https://github.com/arcturusannamalai/open-tamil.git
  2. $ cd open-tamil/ && sudo python setup.py install

Getting contents of web page

First piece makes use of the BeautifulSoup library – bs4, to parse the contents of a URL (webpage), downloaded using standard python library urllib2.

tapage = bs4.BeautifulSoup(urlopen(url))
tatext = tapage.body.text; #e.g. u”உடனே உடனே …”
now the contents of the webpage <BODY> tag are stored in the variable tatext.

Parsing Tamil text

This is where open-tamil library really shines; you can pull out the letters from a tamil string encoded in UTF-8 with a multi-byte encoding, in right order – i.e. you can write programs at the Tamil-letters level instead of worrying about the byte ordering, and uyirmei grouping etc.

Get the tamil letters from the text using the ‘get_letters‘ API,

 taletters = tamil.utf8.get_letters(tatext); #[u”உ”,u”ட”,u”னே”,u”உ”,u”ட”,u”னே”, …]

which returns a list of characters; there is a version which works as a iterator, for large text corpora.

Now you can use the ‘get_tamil_words’ or ‘get_words‘ API call from library to get Tamil only words, or get all words from the letters array taletters,

tamil.utf8.get_tamil_words(taletters) # [u”உடனே”,u”உடனே”, …]

which returns a list of words formed by appropriate grouping of the letters.

Word Frequency Analysis

We can use the python built-in dictionary data type ‘dict‘ to build the word-frequency by using count (frequency) as value for the Tamil word. Code goes like,

 # tamil words only
     frequency = {}
     for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):
          print pos, word
          frequency[word] = 1 + frequency.get(word,0)
Finally an advanced maneuver sorts the dictionary by frequency and prints the list of words from least-frequent to most-frequent,
 # sort words by descending order of occurence
 for l in sorted(frequency.iteritems(), key=operator.itemgetter(1)):
          print l[0],’:’,l[1]

Results

Running this code on a Tamil webpage (Ashokamitran Interview @ The Hindu/Tamil)  yields, the word-frequency list, (showing only the most frequent top 10 words)

தொடர்ந்து : 5
என்று : 5
கருத்து : 6
கவிதை : 6
இல்லை : 7
இருக்கும் : 7
கதை : 7
உடனே : 12
ஆனால் : 12
ஒரு : 16

Putting it Together

import re, operator, bs4 #beautiful soup web-scrapper
import tamil #open-tamil library
def print_tamil_words( tatext ):
     taletters = tamil.utf8.get_letters(tatext)
     # tamil words only
     frequency = {}
     for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):
          print pos, word
          frequency[word] = 1 + frequency.get(word,0)
     # sort words by descending order of occurence
     for l in sorted(frequency.iteritems(),           key=operator.itemgetter(1)):
          print l[0],’:’,l[1]
def demo_tamil_text_filter( ):
     url2 = u”http://ta.wikipedia.org&#8221;
     tapage = bs4.BeautifulSoup(urlopen(url))
     tatext = tapage.body.text;
     print_tamil_words( tatext )
if __name__ == u”__main__”:
     demo_tamil_text_filter()

Download the code from tafilter.py

Learning Tamil + Vocabulary on Internet

Learning Tamil + Internet

Tamil being a classical language and all with kids not able to read or write it, and language stuck by standards and vocabulary written by old men in the pre – computer and digital stone age era it’s all time for change.  We need a fun modern way to learn the Tamil language.

Word Puzzles

Educational games in English language include puzzles like scrabble, jumbled words, matching, word building games to spend leisurely time as well as focused toward particular learning objectives. There are many Tamil applications for android and iPhone with quizzes, hangman like apps.

UrbanTamil Vocabulary

At UrbanTamil project we have believe there are modern approaches to learning the Tamil language and vocabulary. Keeping classical Tamil alive is very important.  But so is keeping popular language alive too. Capturing the usage and educating users is possible via Internet at a low total cost.

  1. Word of the day – via Twitter @urbantamil1 we publish a word and its meaning everyday, so that you may learn the language vocabulary easily
  2. Jumbled words – via http://urbantamil.com/jumble

Hope you have fun with these features, and leave your feedback in comments.

Link

Urbantamil.com – Social Tamil dictionary – web app

Urbantamil.com – Social Tamil dictionary – web app

Today I’m setting up a new project and writing this blog to announce urbantamil.com which is live now. UrbanTamil, is like urbandictionary.com for Tamil, to provide a user centered, social dictionary building experience, and free reuse.

Usage

  1. You can lookup words in dictionary
    1. Use onscreen keyboard to lookup word

      Use onscreen keyboard to lookup word

      Results from lookup for word;

      Results from lookup for word;

  2. You have options for downloading or defining a word
    1. You can download definition as text file; you can also define the word

      You can download definition as text file; you can also define the word

  3. Users can add words as they are used in regional dialect (Chennai, Kovai, Madurai, Thanjai etc.); user can tag them with period usage, parts-of-speech (peyar sol, vinai sol etc);
  4. All content will be contributed back all the entries under CC SA 3.0 to Tamil Wiktionary.
  5.  You can define the word, and add custom information.
  6. You can define the word, and add custom information.
  7. Success screen follows upon correct upload to website
  8. Successful upload to website
    1. Successful upload to website
  9. Moreover users can correct (edit) existing words for spelling mistakes, usage, and define new words.

Goals of the project

  1. Language / Vernacular use
  2. Censorship free/Contemporary natural Tamil – from ஜிகிர்தண்டா (Jigirthanda) – கலாசல் (Kalasal)
  3. Language from Ceylon, Singapore, Canada – Tamil + Punjabi, Tamil + English etc variants
  4. Build a open vocabulary while having fun.

Conclusion

This project will be fun, and I hope you a user community can engage and develop out of the website urbantamil.com; I always welcome feedback, and let me know how you think of it – via email or comments on the blog.

-முத்து அண்ணாமலை

Video

Ezhil – தமிழ் நிரல் எழுது / Write Code in Tamil! – part 1

எழில் (தமிழ் நிரலாக்க மொழி) மூலம் இப்போது நீங்கள் தமிழ் கணினி #நிரல்களை எழுத முடியும்!
Ezhil is Tamil programming language. Write Code in Tamil.

This video explains concepts of Ezhil language, and how you can program via the http://ezhillang.org website. Try programs at http://ezhillang.org/ezhil_eval.html