open-tamil updates (Jan 2015)

I have the pleasure of introducing a few new features in Open-Tamil this week. Among them are quite novel features like Tamil regexp for pattern matching, and Tamil numerals in American counting system. You also probably know Open-Tamil is supported in Python 2 and Python 3 as well, in the development version 0.32.

  1. Tamil tamil.utf8.get_letters bug fixes
    1. get_letters function had a few subtle bugs and somewhat of fuzzy algorithm. With this update, we can completely split a given UTF-8 string into the constituent letters.
    2. get_letters_iterable function is also updated for bug-fixes and works with smaller memory footprint using the iterators in Python.
    3. We do this in linear time O(n)
  2. Numeral generation from open-tamil in American style
    1. Previously we introduced a numeral generation using the Indian convention of crores, lakhs upto 1 lakh crore; tamil.numeral.num2tamilstr
    2. In this update we can convert numbers using the million, billion and trillion of the American numeral system, in Tamil words.tamil.numeral.num2tamilstr_americanAn example of conversions is shown from our test suite,
      def test_numerals(self):
        var = {0:u"பூஜ்ஜியம்",
        long(1e7):u"பத்து மில்லியன்",
        long(1e9-1):u"தொள்ளாயிரத்து தொன்னூற்றி ஒன்பது மில்லியன் தொள்ளாயிரத்து தொன்னூற்றி ஒன்பது ஆயிரத்தி தொள்ளாயிரத்து தொன்னூற்றி ஒன்பது",
        3060:u"மூன்று ஆயிரத்தி அறுபது",
        21:u"இருபத்தி ஒன்று",
        1051:u"ஓர் ஆயிரத்தி ஐம்பத்தி ஒன்று",
        100000:u"நூறு ஆயிரம்",
        100001:u"நூறு ஆயிரத்தி ஒன்று",
        10011:u"பத்து ஆயிரத்தி பதினொன்று",
        49:u"நாற்பத்தி ஒன்பது",
        55:u"ஐம்பத்தி ஐந்து",
        1000001:u"ஒரு மில்லியன் ஒன்று",
        99:u"தொன்னூற்றி ஒன்பது",
        101:u"நூற்றி ஒன்று",
        1000:u"ஓர் ஆயிரம்",
        111:u"நூற்றி பதினொன்று",
        1000000000000:u"ஒரு டிரில்லியன்",
        1011:u"ஓர் ஆயிரத்தி பதினொன்று"}
        for k,actual_v in var.items():
            v = tamil.numeral.num2tamilstr_american(k)
            print('verifying => # %d'%k)
    3. There were a few minor bug fixes
  3. Tamil regular expression processing
    1. Regular expression is form of finite automata. These are machines with local states which may be used for pattern matching.

    2. We have introduced new API in the Python module ‘tamil’ under the namespace ‘regexp’ which will expand Tamil letters into fully formed regular expressions, and can work in tandem with Python re module.
    3. example: the following ‘pattern’ will matching the elements 1, 2, 6 of the list variable ‘data’.
      pattern = u"^[க-ள].+[க்-ள்]$"
      data = [u"இந்த",u"தமிழ்",u"ரெகேஸ்புல்",u"\"^[க-ள].+[க்-ள்]$\"",\
              u"இத்தொடரில்", u"எதை", u"பொருந்தும்"]
      expected = [1,2,6] # i.e.தமிழ், ரெகேஸ்புல், and பொருந்தும்
    4. Another simple example experimenting with Tamil #regexp: pattern = u”^ரிச்.*[க்-ழ்]$” matches strings like ரிச்மாண்டின் and ரிச்மண்டில்.
    5. Tamil wikipedia article on has a good explanation on regular expressions (சுருங்குறித்_தொடர்).

An example:
You can do a lot more things with open-tamil, and a simple example demonstrated in the item #3 above.

Have a nice weekend. Share your comments, and thoughts on open-tamil below.

3 thoughts on “open-tamil updates (Jan 2015)

  1. Hello Elango,
    Thanks for the pointer. I haven’t gone through the notes, but I invite you to blog about it in a more expository way. You can also guest post on ezhillang.wordpress here.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s