Introduction
I have been keen to understand what kind of articles are read, and written, in Tamil Wikipedia, and Wiktionary. So it was time to use data analysis and some programming – so thats what I did! Last weekend I tried to take Open-Tamil Python library for a spin with the dumps of Wikipedia data for Tamil wiki and here are results.
You can find my actual program here, solpattiyal.py
Pre-requisites
- Install Python 2.7 or Python 3 – whichever flavors you want from http://www.python.org
- Get open-tamil library v0.40 from Python Package Index
- If you have pip installed in your system just type,
$ pip install –upgrade open-tamil
- Get Wikipedia Tamil dumps from Wikipedia servers
- Download the file solpattiyal.py from the above link, or get whole of open-tamil from github.
Program Usage
- For small text dumps in kB sizes you can see output on terminal,
$ python solpattiyal.py <filename1>
- You can also use multiple files input
$ python solpattiyal.py <filename1> <filename2> …
- Then you may want to use output redirection like,
$ python solpattiyal.py demo_file1.xml demo_file2.xml > output
Analysis of Code
- The code in solpattiyal is fairly simple and uses an algorithm to parse out Tamil letters from each file
- We group letters into words via static method ‘WordFrequency.get_tamil_words‘; (this method will make it into next version of open-tamil itself, after this demo)
- We insert each Tamil word into the dictionary and bump up its frequency by 1
- Finally we use the sorted() method in Python with the comparator key to print list by frequency, and again we print it by sorted order.
- Code is written in particular way to straddle both Python 2.7 and Python 3.
- Code is written to handle multiple files – usually Wikipedia files are large, and I like to use GNU split utility like this (to split at every 300,000 lines of text),
$ split -l 300000 <filename>
Data
- DISCLAIMER :
- This analysis is not a criticism of Tamil Wikipedia.
- I am a Tamil Wikipedia contributor in last several years, and Wikipedian circa 2005.
- This data analysis is not complete/comprehensive – feel free to point out details
- Sample data from my analysis of recent Wikipedia title dump file yielded some interesting data on Tamil wikipedia article distribution.
- The data file can be found in common-words-ta-wikipedia-data-March-16-2015.
- My recommendations are
- Every Tamil speaking specialist can being stub articles or add information to broaden other articles in their fields
- You can think of contributing 1 article every month!!
- Consider broadening Tamil conversations beyond here-and-now, to world of science, math, medicine, engineering, arts and philosophy
Feedback
- Please send your comments and questions always to me at ezhillang in gmail, or via Twitter @ezhillang
- Feel free to improve on this code, and send a pull request in github.