Contributing to open-tamil : பங்களிக்கலாம்
You may have heard of open-tamil, the free Tamil text processing tools library for Python 2 and 3. If you want to join the project please fo
llow the directions.
- Create a github account from http://www.github.com, and send me your handle via email (ezhillang -AT- gmail.com)
- Learn about git – version control system (use Google if you don’t follow anything)
- Clone the repository from instructions at https://github.com/arcturusannamalai/open-tamil
- You should be all setup now; cd to the repository and try to install open-tamil locally in your Python setup
- Run the command, ‘python setup.py build’ first
- Upon success run the command, ‘python setup.py install’
- You may have to use sudo permissions in Linux if you are not using virtualenv
To use open-tamil, and understand the functions you may read the docs from blog post, and example code,
- Blog post on open-tamil (most important functions) : https://ezhillang.wordpress.com/2014/01/26/open-tamil-text-processing-%E0%AE%89%E0%AE%B0%E0%AF%88-%E0%AE%AA%E0%AE%95%E0%AF%81%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AE%BE%E0%AE%AF%E0%AF%8D%E0%AE%B5%E0%AF%81/
- Example code using open-tamil package – try example code in directory called examples under the root of repo : https://github.com/arcturusannamalai/open-tamil/tree/master/examples
Hopefully you can get started on open-tamil with these resources. If not always leave a comment, drop an email @gmail.com for ezhillang, or tweet to me @ezhillang.
I have been keen to understand what kind of articles are read, and written, in Tamil Wikipedia, and Wiktionary. So it was time to use data analysis and some programming – so thats what I did! Last weekend I tried to take Open-Tamil Python library for a spin with the dumps of Wikipedia data for Tamil wiki and here are results.
You can find my actual program here, solpattiyal.py
- Install Python 2.7 or Python 3 – whichever flavors you want from http://www.python.org
- Get open-tamil library v0.40 from Python Package Index
- If you have pip installed in your system just type,
$ pip install –upgrade open-tamil
- Get Wikipedia Tamil dumps from Wikipedia servers
- Download the file solpattiyal.py from the above link, or get whole of open-tamil from github.
- For small text dumps in kB sizes you can see output on terminal,
$ python solpattiyal.py <filename1>
- You can also use multiple files input
$ python solpattiyal.py <filename1> <filename2> …
- Then you may want to use output redirection like,
$ python solpattiyal.py demo_file1.xml demo_file2.xml > output
Analysis of Code
- The code in solpattiyal is fairly simple and uses an algorithm to parse out Tamil letters from each file
- We group letters into words via static method ‘WordFrequency.get_tamil_words‘; (this method will make it into next version of open-tamil itself, after this demo)
- We insert each Tamil word into the dictionary and bump up its frequency by 1
- Finally we use the sorted() method in Python with the comparator key to print list by frequency, and again we print it by sorted order.
- Code is written in particular way to straddle both Python 2.7 and Python 3.
- Code is written to handle multiple files – usually Wikipedia files are large, and I like to use GNU split utility like this (to split at every 300,000 lines of text),
$ split -l 300000 <filename>
- DISCLAIMER :
- This analysis is not a criticism of Tamil Wikipedia.
- I am a Tamil Wikipedia contributor in last several years, and Wikipedian circa 2005.
- This data analysis is not complete/comprehensive – feel free to point out details
- Sample data from my analysis of recent Wikipedia title dump file yielded some interesting data on Tamil wikipedia article distribution.
- The data file can be found in common-words-ta-wikipedia-data-March-16-2015.
- My recommendations are
- Every Tamil speaking specialist can being stub articles or add information to broaden other articles in their fields
- You can think of contributing 1 article every month!!
- Consider broadening Tamil conversations beyond here-and-now, to world of science, math, medicine, engineering, arts and philosophy
- Please send your comments and questions always to me at ezhillang in gmail, or via Twitter @ezhillang
- Feel free to improve on this code, and send a pull request in github.