Open-Tamil : Text processing (உரை பகுப்பாய்வு)

ஆவணத்தில் பிரித்தெடுக்கப்பட்ட வார்த்தைகள் – 

Extracting words in document

In this short blog post, I want to show you how you can find the word frequency of a webpage using

  1. Open-Tamil,
  2. BeautifulSoup

in the Python framework.

Install Open-Tamil package

Open-Tamil provides tools to,

  1. parse and analyze text in UTF-8 format, in the package ‘tamil.utf8‘,
  2. it has packages for elementary ‘ngram‘ modeling in Tamil,
  3. format converters for TSCII -> Unicode in ‘tamil.tscii‘ package,
  4. transliteration utilities under ‘transliterate‘ package

but before you can get started you have to install the library.

You can try, the easy/recommended method,

  1. $ pip install open-tamil

or get latest from the github repository by,

  1. $ git clone https://github.com/arcturusannamalai/open-tamil.git
  2. $ cd open-tamil/ && sudo python setup.py install

Getting contents of web page

First piece makes use of the BeautifulSoup library – bs4, to parse the contents of a URL (webpage), downloaded using standard python library urllib2.

tapage = bs4.BeautifulSoup(urlopen(url))
tatext = tapage.body.text; #e.g. u”உடனே உடனே …”
now the contents of the webpage <BODY> tag are stored in the variable tatext.

Parsing Tamil text

This is where open-tamil library really shines; you can pull out the letters from a tamil string encoded in UTF-8 with a multi-byte encoding, in right order – i.e. you can write programs at the Tamil-letters level instead of worrying about the byte ordering, and uyirmei grouping etc.

Get the tamil letters from the text using the ‘get_letters‘ API,

 taletters = tamil.utf8.get_letters(tatext); #[u”உ”,u”ட”,u”னே”,u”உ”,u”ட”,u”னே”, …]

which returns a list of characters; there is a version which works as a iterator, for large text corpora.

Now you can use the ‘get_tamil_words’ or ‘get_words‘ API call from library to get Tamil only words, or get all words from the letters array taletters,

tamil.utf8.get_tamil_words(taletters) # [u”உடனே”,u”உடனே”, …]

which returns a list of words formed by appropriate grouping of the letters.

Word Frequency Analysis

We can use the python built-in dictionary data type ‘dict‘ to build the word-frequency by using count (frequency) as value for the Tamil word. Code goes like,

 # tamil words only
     frequency = {}
     for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):
          print pos, word
          frequency[word] = 1 + frequency.get(word,0)
Finally an advanced maneuver sorts the dictionary by frequency and prints the list of words from least-frequent to most-frequent,
 # sort words by descending order of occurence
 for l in sorted(frequency.iteritems(), key=operator.itemgetter(1)):
          print l[0],’:’,l[1]

Results

Running this code on a Tamil webpage (Ashokamitran Interview @ The Hindu/Tamil)  yields, the word-frequency list, (showing only the most frequent top 10 words)

தொடர்ந்து : 5
என்று : 5
கருத்து : 6
கவிதை : 6
இல்லை : 7
இருக்கும் : 7
கதை : 7
உடனே : 12
ஆனால் : 12
ஒரு : 16

Putting it Together

import re, operator, bs4 #beautiful soup web-scrapper
import tamil #open-tamil library
def print_tamil_words( tatext ):
     taletters = tamil.utf8.get_letters(tatext)
     # tamil words only
     frequency = {}
     for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):
          print pos, word
          frequency[word] = 1 + frequency.get(word,0)
     # sort words by descending order of occurence
     for l in sorted(frequency.iteritems(),           key=operator.itemgetter(1)):
          print l[0],’:’,l[1]
def demo_tamil_text_filter( ):
     url2 = u”http://ta.wikipedia.org&#8221;
     tapage = bs4.BeautifulSoup(urlopen(url))
     tatext = tapage.body.text;
     print_tamil_words( tatext )
if __name__ == u”__main__”:
     demo_tamil_text_filter()

Download the code from tafilter.py