ஆவணத்தில் பிரித்தெடுக்கப்பட்ட வார்த்தைகள் –
Extracting words in document
In this short blog post, I want to show you how you can find the word frequency of a webpage using
in the Python framework.
Install Open-Tamil package
Open-Tamil provides tools to,
- parse and analyze text in UTF-8 format, in the package ‘tamil.utf8‘,
- it has packages for elementary ‘ngram‘ modeling in Tamil,
- format converters for TSCII -> Unicode in ‘tamil.tscii‘ package,
- transliteration utilities under ‘transliterate‘ package
but before you can get started you have to install the library.
You can try, the easy/recommended method,
- $ pip install open-tamil
or get latest from the github repository by,
- $ git clone https://github.com/arcturusannamalai/open-tamil.git
- $ cd open-tamil/ && sudo python setup.py install
Getting contents of web page
First piece makes use of the BeautifulSoup library – bs4, to parse the contents of a URL (webpage), downloaded using standard python library urllib2.
url = ‘http://tamil.thehindu.com’tapage = bs4.BeautifulSoup(urlopen(url))tatext = tapage.body.text; #e.g. u”உடனே உடனே …”
Parsing Tamil text
This is where open-tamil library really shines; you can pull out the letters from a tamil string encoded in UTF-8 with a multi-byte encoding, in right order – i.e. you can write programs at the Tamil-letters level instead of worrying about the byte ordering, and uyirmei grouping etc.
Get the tamil letters from the text using the ‘get_letters‘ API,
taletters = tamil.utf8.get_letters(tatext); #[u”உ”,u”ட”,u”னே”,u”உ”,u”ட”,u”னே”, …]
which returns a list of characters; there is a version which works as a iterator, for large text corpora.
Now you can use the ‘get_tamil_words’ or ‘get_words‘ API call from library to get Tamil only words, or get all words from the letters array taletters,
tamil.utf8.get_tamil_words(taletters) # [u”உடனே”,u”உடனே”, …]
which returns a list of words formed by appropriate grouping of the letters.
Word Frequency Analysis
We can use the python built-in dictionary data type ‘dict‘ to build the word-frequency by using count (frequency) as value for the Tamil word. Code goes like,
# tamil words onlyfrequency = {}for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):print pos, wordfrequency[word] = 1 + frequency.get(word,0)
# sort words by descending order of occurencefor l in sorted(frequency.iteritems(), key=operator.itemgetter(1)):print l[0],’:’,l[1]
Results
Running this code on a Tamil webpage (Ashokamitran Interview @ The Hindu/Tamil) yields, the word-frequency list, (showing only the most frequent top 10 words)
தொடர்ந்து : 5
என்று : 5
கருத்து : 6
கவிதை : 6
இல்லை : 7
இருக்கும் : 7
கதை : 7
உடனே : 12
ஆனால் : 12
ஒரு : 16
Putting it Together
import re, operator, bs4 #beautiful soup web-scrapperimport tamil #open-tamil librarydef print_tamil_words( tatext ):taletters = tamil.utf8.get_letters(tatext)# tamil words onlyfrequency = {}for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):print pos, wordfrequency[word] = 1 + frequency.get(word,0)# sort words by descending order of occurencefor l in sorted(frequency.iteritems(), key=operator.itemgetter(1)):print l[0],’:’,l[1]def demo_tamil_text_filter( ):url2 = u”http://ta.wikipedia.org”tapage = bs4.BeautifulSoup(urlopen(url))tatext = tapage.body.text;print_tamil_words( tatext )if __name__ == u”__main__”:demo_tamil_text_filter()
Download the code from tafilter.py