Ezhil Tamil Programming Language

GNU iconv – convert from UTF-8 to TSCII and back

iconv a GNU utility can help converting text documents back and forth from various encoding schemes. Particularly it is of interest to us, Tamil speaking folks, because it can convert from UTF-8 to TSCII and back.

If you wanted to convert, hello.utf8 from UTF-8 encoding into TSCII you could use it as follows,

$ iconv -f utf-8 -t tscii hello.utf8 > hello.tscii

where in the Linux shell environment you can redirect the output into the TSCII encoded file.

Developers: Someday I hope volunteers will add more historical Tamil encodings, primarily TAM, TAB, and other font-based encoding schemes to the libiconv. Please start development using git repository at GNU sources.

open-tamil 0.40

Today I’m releasing the updated version of open-tamil 0.40 for Python. You can download it from https://pypi.python.org/pypi/Open-Tamil or via python package manager pip,

 $ pip install –upgrade open-tamil

Notable features of this update are,

  1. Major update is that open-tamil works for Python 3 as well as Python 2.
  2. Bug fixes to tamil.utf8.get_letters() API
  3. Convert integers to Tamil numerals –  tamil.numeral.num2tamilstr() API
  4. Tamil regular expressions API to parse and explore Tamil text via  regular expressions.
  5. Santhi rules parser in Open-Tamil for word conjugation

Hope this is another useful update of the Python library for our users, and hope open-tamil can make Tamil software development easier for you.

-Muthu

Story of Tamil encodings: TACE, UTF8 and open-tamil

Hello everyone! Hope this finds you in good spirits. In North America, Boston especially, the cold weather and holiday season is our Deepavali as it were. Cold weather and festive season brings a lot of joys and challenges – but today I want to talk about the challenges of exchanging Tamil text/information on the Internet.

Christmas tree lighting ceremony at Boston Common. Dec, 2014.

History of Tamil encodings

Tamil encoding has a long history with 8-bit extended ASCII called TSCII, TAM (Tamil Monolingual)/TAB (Tamil Bi-lingual) encodings of the late 80s and early 90s.

Then enter Windows. Microsoft Windows with the Microsoft Word let some Tamil software vendors introduce font-based encodings. This probably is the most egregious of all Tamil encoding methods invented, IMO. Still this would show the books in fonts like Latha, Lohit etc. You needed the right font-map  with glyphs independent of encoding to read the text. Otherwise the text would end up garbled like a mish-mash of ‘?’ or []-(tofu block) characters.

Finally the Tamil computing community, software ventors, members of INFITT (among others Mr. Chellappan of Palaniappa Brothers, Chennai, Prof. Ponnavaiko, and Dr. K. Kalyanasundaram of EPFL) sat down with the pioneering people at Unicode consortium and hashed out a chunk of the Unicode standards space for Tamil letters, which is what we have today. So now you know if Android, iOS and Windows support the Tamil text, it is most likely due to the benefit of years of work from this motely crew of genteel anonymous strangers. Thanks everyone!

Now the web had matured since 1990s, and Unicode support by blogs, and input method editors (IME) on Linux, Windows and Mac enabled growth and exchange of Tamil content on the Internet. Unicode encoded in UTF-8 became the de-facto standard of the Tamil community online, despite diktats from Tamilnadu government, and other standards agencies which were left behind in the shadows of stand-alone computing world. Welcome Internet. Now change was not an option.

TACE v. UTF-8

Today there is an alternative proposal, has been for many years I think. TACE standard is being championed by people in Tamilnadu government and a few publishing agencies. Prof. Ponnavaiko is rumoured to be on some of these committees. The wikipedia article  summarizes their case for TACE16 encoding standard because of some computational ease in representing the standard. I still think UTF-8 suits the general purpose computation, especially on the web.

I take exception to pushing TACE16 particulary because of the not-invented-here syndrome.  Unicode is not just for Tamil, it is shared with English, Arabic, Chinese, Cantonese and other tonal and syllabic languages. Still yet some languages are – so we don’t have the direct glyph per code mapping. The situation is somewhat similar for Hindi and other Indic-languages.

Another reason is advent of libraries, one of which I helped create and open-source : open-tamil. I’m sure other Tamil developers have their own versions behind corporate or closed doors to pre-process UTF-8 and then do text manipulation.

I guess time will pick the technology based on capability evolution and utility of existing tools. I hope we don’t return to font-based-encoding of the 90s and 2000s, and live in the more saner world of Unicode. If TACE16 should be a new standard, I hope someone makes the converters from and to UTF-8.

Open-Tamil : Text processing (உரை பகுப்பாய்வு)

ஆவணத்தில் பிரித்தெடுக்கப்பட்ட வார்த்தைகள் – 

Extracting words in document

In this short blog post, I want to show you how you can find the word frequency of a webpage using

  1. Open-Tamil,
  2. BeautifulSoup

in the Python framework.

Install Open-Tamil package

Open-Tamil provides tools to,

  1. parse and analyze text in UTF-8 format, in the package ‘tamil.utf8‘,
  2. it has packages for elementary ‘ngram‘ modeling in Tamil,
  3. format converters for TSCII -> Unicode in ‘tamil.tscii‘ package,
  4. transliteration utilities under ‘transliterate‘ package

but before you can get started you have to install the library.

You can try, the easy/recommended method,

  1. $ pip install open-tamil

or get latest from the github repository by,

  1. $ git clone https://github.com/arcturusannamalai/open-tamil.git
  2. $ cd open-tamil/ && sudo python setup.py install

Getting contents of web page

First piece makes use of the BeautifulSoup library – bs4, to parse the contents of a URL (webpage), downloaded using standard python library urllib2.

tapage = bs4.BeautifulSoup(urlopen(url))
tatext = tapage.body.text; #e.g. u”உடனே உடனே …”
now the contents of the webpage <BODY> tag are stored in the variable tatext.

Parsing Tamil text

This is where open-tamil library really shines; you can pull out the letters from a tamil string encoded in UTF-8 with a multi-byte encoding, in right order – i.e. you can write programs at the Tamil-letters level instead of worrying about the byte ordering, and uyirmei grouping etc.

Get the tamil letters from the text using the ‘get_letters‘ API,

 taletters = tamil.utf8.get_letters(tatext); #[u”உ”,u”ட”,u”னே”,u”உ”,u”ட”,u”னே”, …]

which returns a list of characters; there is a version which works as a iterator, for large text corpora.

Now you can use the ‘get_tamil_words’ or ‘get_words‘ API call from library to get Tamil only words, or get all words from the letters array taletters,

tamil.utf8.get_tamil_words(taletters) # [u”உடனே”,u”உடனே”, …]

which returns a list of words formed by appropriate grouping of the letters.

Word Frequency Analysis

We can use the python built-in dictionary data type ‘dict‘ to build the word-frequency by using count (frequency) as value for the Tamil word. Code goes like,

 # tamil words only
     frequency = {}
     for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):
          print pos, word
          frequency[word] = 1 + frequency.get(word,0)
Finally an advanced maneuver sorts the dictionary by frequency and prints the list of words from least-frequent to most-frequent,
 # sort words by descending order of occurence
 for l in sorted(frequency.iteritems(), key=operator.itemgetter(1)):
          print l[0],’:’,l[1]

Results

Running this code on a Tamil webpage (Ashokamitran Interview @ The Hindu/Tamil)  yields, the word-frequency list, (showing only the most frequent top 10 words)

தொடர்ந்து : 5
என்று : 5
கருத்து : 6
கவிதை : 6
இல்லை : 7
இருக்கும் : 7
கதை : 7
உடனே : 12
ஆனால் : 12
ஒரு : 16

Putting it Together

import re, operator, bs4 #beautiful soup web-scrapper
import tamil #open-tamil library
def print_tamil_words( tatext ):
     taletters = tamil.utf8.get_letters(tatext)
     # tamil words only
     frequency = {}
     for pos,word in enumerate(tamil.utf8.get_tamil_words(taletters)):
          print pos, word
          frequency[word] = 1 + frequency.get(word,0)
     # sort words by descending order of occurence
     for l in sorted(frequency.iteritems(),           key=operator.itemgetter(1)):
          print l[0],’:’,l[1]
def demo_tamil_text_filter( ):
     url2 = u”http://ta.wikipedia.org&#8221;
     tapage = bs4.BeautifulSoup(urlopen(url))
     tatext = tapage.body.text;
     print_tamil_words( tatext )
if __name__ == u”__main__”:
     demo_tamil_text_filter()

Download the code from tafilter.py