Spell Checkers for South Asian Languages : Solthiruthi wiki

Hello everyone! It is finally spring here in Boston. We have warm weather, beautiful Magnolias in the bloom, and warm sunny days. Magnolias in bloom. Magnolias in bloom.

It is hardly good to be the only smart person in the room; today I started a wiki to collect all the good work done by several academics and industrial scientists, engineers in the field of spell checking.

Computer scientists and Tamil Linguistics aficianados interested in spell checker architectures please see the wiki

This is one more small step towards building an open-source Solthiruthi in Tamil

வாங்க பழகலாம் – பங்களிக்கலாம்

Contributing to open-tamil : பங்களிக்கலாம்

You may have heard of open-tamil, the free Tamil text processing tools library for Python 2 and 3. If you want to join the project please fo

success_Arun_Vekataswamy_8184568429

llow the  directions.

  1. Create a github account from http://www.github.com, and send me your handle via email (ezhillang -AT- gmail.com)
  2. Learn about git – version control system (use Google if you don’t follow anything)
  3. Clone the repository from instructions at https://github.com/arcturusannamalai/open-tamil
  4. You should be all setup now; cd to the repository and try to install open-tamil locally in your Python setup
  5. Run the command, ‘python setup.py build’ first
  6. Upon success run the command, ‘python setup.py install’
  7. You may have to use sudo permissions in Linux if you are not using virtualenv

To use open-tamil, and understand the functions you may read the docs from blog post, and example code,

  1. Blog post on open-tamil (most important functions) : https://ezhillang.wordpress.com/2014/01/26/open-tamil-text-processing-%E0%AE%89%E0%AE%B0%E0%AF%88-%E0%AE%AA%E0%AE%95%E0%AF%81%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AE%BE%E0%AE%AF%E0%AF%8D%E0%AE%B5%E0%AF%81/
  2. Example code using open-tamil package – try example code in directory called examples under the root of repo : https://github.com/arcturusannamalai/open-tamil/tree/master/examples

Hopefully you can get started on open-tamil with these resources. If not always leave a comment, drop an email @gmail.com for ezhillang, or tweet to me @ezhillang.

Anbudan,

-Muthu

Solthiruthi – Multi-pass spell checker for Tamil language (Draft-1)

Solthiruthi – A Multi-pass spell checker for Tamil language (Draft-1)

Author: Muthiah Annamalai,  edited: April, 21, 2015

Boston Marathon 2015, mainly SRR. (Ref:  https://www.flickr.com/photos/tfxc/16609255733/in/  | CC by ND NC 2.0 )
Boston Marathon 2015, mainly SRR. (Ref:
https://www.flickr.com/photos/tfxc/16609255733/in/ | CC by ND NC 2.0 )

Introduction

Following compiler construction theory, we may write a multi-pass spell checker for Tamil language. Tamil falls under a linguistic classification  as a agglutinative language, and has a phonetic script.

In this article I will present a conceptual framework, a proposal, for writing a multi-pass solthiruthi spell-checker. This effort is variation on theme of other spell checkers for Tamil created by projects myspell, hunspell, navi/vani, and languagetool respectively.

N.B.: I invite interested individuals to contribute to this document, and collaborate with me, under usual open-source credit/attribution terms.

Problem

  1. Given a Tamil text, urai, we want to find out all spelling mistakes in the text
  2. For words which are in error, we want to further provide alternatives
  3. For Tamil language we have to apply santhi rules, puranchi vithigal, from grammar.
  4. We also would like to provide context sensitive help like A.I.

Solution Overview

I propose a multi-pass spell checker comprised of N-levels. Each level upto the penunltimate (Level N-1) will continue to refine the words which are found in error; we call this analysis levels. The final level will form the suggestions for the words in error; we call this synthesis level. Such an algorithm can be built from 2-Level and upwards in the complexity.

We use a dictionary of valid Tamil words, to validate the Incoming text is split into the form of words, and each word goes through the N-steps.

Typical levels for analysis include, but not limited to,

  1. Dictionary based white-list analysis for incorrectly spelled words
  2. Morphological analysis via stemmer – this is a requirement of agglutinative languages
  3. Clean up proper nouns words, peyarechcham,  using names of places, people, things
  4. Bayesian classification of errors using bigram, trigram, N-gram letter probabilities
  5. Context sensitive word suggestion by Bayesian classification using N-gram word probabilities
  6. Keyboard models, Bamini, Tamil99, Anjal, layouts and common misspellings in these
  7. Edit-distance based analysis for mistyped words
  8. Santhi checker
  9. Punarchi checker
  10. Detect transliterated English words in Tamil and ignore

Synthesis step is usually made from the following,

  1.  Norvig algorithm for spelling suggestions modified for Tamil arichuvadi
  2.  For Tamil we need to combine Santhi, and Punarchi rule based pruning of suggestions

Pizhai : 2-Level spell checker

Pizhai spell checker algorithm uses the 2 analysis steps of whitelisted dictionary and the bigram/trigram letter analysis to flag erroneous words. Finally we suggest word alternates to the user during synthesis.

Solthiruthi : 3-Level spell checker

Solthiruthi spell checker algorithm uses the 3 analysis steps of whitelisted dictionary and the bigram/trigram letter analysis to flag erroneous words, and includes a stemmer. Finally we suggest word alternates to the user during synthesis.

Code

All of these ideas can be implemented in Python using open-tamil project, today. We welcome enthusiastic contributors – linguists, language enthusiasts and developers.

கூடம் – எழில் கற்க இணையம் வழி பள்ளிக்கூடம் – 2

Online evaluator - latest version - for Ezhil Language.
Online evaluator – latest version – for Ezhil Language.

இன்று நள்ளிரவு எழுதிய பைத்தான் நிரலிநால் (இந்த கிட்ஹப் கமிட்டை காணவும்) எழில் மொழியை இணையம் வழி கற்க பள்ளிக்கூடம் ஆக அமைய வாய்பு உண்டு. இதனுடைய அமைப்பு பல விஷயங்கள் கொண்டது. கீழே காண்க.

Feature list
Code in this directory provides the following
* Writing Code, Editing, and Evaluating *
1. Syntax highlighting editor for Ezhil using ACE JavaScript editor
2. Code browser lets user to look at sample Ezhil programs from the ezhil-lang source/testsuite, in the single page app editor
3. Users can run the code on this page, and see the output in the same page.
4. Correctly executed code with should output in light yellow; clicking on the output will hide it, as you work on second problem.
5. Errors in code or server execution cause your program output to be highlighted in red.
6. Source code is persisted between sessions in terms of cookies

இதை http://www.ezhillang.org/koodam/play/eval இல் நீங்கள் பயன்படுத்தலாம்!