A bipartite graph structure for Tamil

Remark: Tamil alphabets [which are Abugida or alphasyllabary in nature] can be written as a fully connected bipartite graph G(C+V,E). Both the basic 247 letters [known to have a ring representation] and sequences involving வட மொழி letters can be written in the sequence of two sets, V – vowels [உயிர்] and C – consonants [மெய்], and edges E: C -> V forming a map from each consonant to vowel (e.g.: க் + அ -> க ) are the உயிர்மெய் எழுத்துக்கள். This is a K_{\left[18 \times 12\right]} bipartite graph. Strictly speaking we can add அயுத எழுத்து ‘ஃ’ as a disconnected node and call it a K_{\left[ 18 \times 12\right]} + 1 forest graph. This may be simply extended to cover the வட மொழி எழுத்துக்கள் [Sanskrit letters optionally used in Tamil]. Full alphabet set is obtained by cumulative sum of edges and vertices.

Corollary: Most other alphasyllabary, Abugida languages have a similar bipartite graph representation.

Fig. 1: A fully connected Bipartite graph K(5,3). Credit: Wikipedia.

A group structure for Tamil

We can form a group structure for Tamil alphabets in many ways; simply we may apply residue classes modulo N or symmetric group of permutations modulo N for any cardinality. However, one interesting group structure with applications is the abstraction of 247 Tamil letters written on a torus; in this essay I will attempt to describe it and show that it forms a group.

We consider the 247 Tamil letters formed by 1 ayudha letter and 12 uyir letters for 13 vowels, and 18 mei letters for 18 consonants and 216 uyirmei or conjugate letters [247 = 13 + 18 + 216]. By consider a mapping of 13 vowels to Z13[residue classes modulo 13] and 18 uyirmei letters + ayutha letter to Z19 [residue classes modulo 19].

Fig. 1: The Cayley table for Z13 can represent Uyir letters.
Fig. 2: The Cayley table for Z19 can represent Mei letters (with modification)


Further we may represent each uyirmei letter as a index into a 2D table formed by rows of mei letters, and columns of uyir letters. So, for example letter ‘கு = க் + ஊ’ can be written as 6 + 1*13 = 19. Uyir letters are all represented from [0-12], Mei letters are represented as multiples of 13, [13, 26, 39, .. 234] for [க், ச், … ல், வ், ழ், ள்]. Uyirmei letters form everything in between.

The general representation of a letter can be: t = a + 13*b, where a goes from [0-12] and b goes from [0-18]. This representation pegs ‘ஃ’ at the origin. In the direct product of Z13 and Z19 this will be represented as (a,b)

Letter representation in the product group: Z13 x Z19


Further since we showed uyir and mei letters can be embedded into the Z13, and Z19 residue classes and we know 247 factors neatly into 2 primes 13 and 19, we may use the Chinese remainder theorem (which guarantees that given two sets of residue classes which are co-prime, we can form a residue class with a unique representation for the direct-sum [direct-product] of the underlying sets). In our case we are guaranteed that Z13 x Z19 direct sum structure forms an isomorphic group in Z247. This is the key result in this easy:

Tamil letters [247] have a direct product representation in group Z247 which is isomorphic to the direct product of Z13, Z19 as mapping the uyir and mei group representations.

Key result – Group representation for Tamil alphabets

While Chinese remainder theorem guarantees a ring structure, I don’t know the second operator which can take role of product to make the ring structure possible at this writing.

Tamil Entry via Keypad – 9XYZ30-த-மி-ழ்

Previously, My initial calculations can be revised in terms of the estimates. I will not go into further detail here; my latest estimate shows the number of realizable keyboards to be 264,250,749,803,040 or 264billion – a bit of an astronomical number.

The money questions are the following:

  1. Given the astronomical size of keyboards possible is there one that is easily decodable than the other ? Yes, or no ?
  2. Is there any decodable keyboard at all?
  3. Is there a ‘1-800-FLOWERS‘ type of representation possible atleast for a few words in Tamil ?

Today, I was toying with some simple designs and made it into software:

Fig 1: Simple 4×3 keypad layout in iOS

One particular realization of the keyboard looks like where 20 Tamil letters are roughly mapped into 1 keypad as shown in the excel sheet below. We also see the canonical 4×3 keypad matrix in the rows 20-23 showing the 12 keypad positions where 20 letters are going to be mapped into.

Fig 2: Mapping first 20 letters of Tamil alphabet set into a 4×3 keypad.

We show how the phone number “9XYZ30477” will mean “9XYZ30-த-மி-ழ்” in this keypad.

Fig. 3: A simple realization of keypad mapping in Tamil; e.g. number “9XYZ30477” would can be advertised as ‘9XYZ30-த-மி-ழ்’.

Immediately few things are coming to our attention:

  1. Entering user input in the keypad is easy; we follow a simple natural language suggested representation
  2. However, we have some issues in realizing this keyboard – ambiguity: Does ‘111’ in this keypad entry, with following mapping shown, mean ‘அக்கா’ or ‘கட்சி’ ?
  3. The “obvious” finite ring keypad mapping fails here.


  1. Whereas a simple keyboard realization of this scheme shows words typed of equal length like ‘அக்கா’ and ‘கட்சி’ are completely undecidable/un-decodeable. So our criteria is really the good realizable keyboard maximizes the word decidability, or minimizes word collision.
  2. Ease of user input:
    Also we may want to make ease of user entry into this keyboard simpler [which the ‘obvious choice’ keyboard contains] while still maintaining the decodability.
  3. We identify the mapping used above with a simple algebraic structure similar to a finite semi-group with operations of commutativity, in-group operation and identity formed by ‘ஃ’ ayutha letter. This is a interesting mapping with potential to adapt the operator for creating a full semigroup or group structure for the language.
  4. Finally we discover:

The letters with the high bi-gram frequency may not be co-occurring in the same keypad square. This is an operational principle that will reduce the ambiguity of the model. We will have to balance this with other decidability criteria of user input etc.

Operating Principle – we understand this from our failed experiment.

This type of keyboard design could also equally apply for other Abugida languages – which is most Indian languages.

Not Durian

It is easy to confuse Jackfruit and Durian. Jackfruit is one of the famous ‘muk kani’ [முக்கனி – மா, பலா, வாழை] trio of fruits from Tamilnadu – Mango, Jack and Banana. Durian is not quite native of Tamilnadu [AFAIK], but more popular in equatorial south east Asia. Not to be out done, Tamil people have gotten taste of this fruit as well; globally however Durian aficionados remain a minority – the fruit is more widely known for being banned from airlines, airports and public arenas for its somewhat off-putting smell to the people unfamiliar with its taste; those ignorant of such this finer thing have no proclivity to this fruit and continue to cast bad light on it.

One day last year during the Thanksgiving holiday here in California, I went out to a grocery store in Bay Area. Silicon Valley, Lyndon B. Johnson’s opening up gates of America to Asian immigrants, the Gold Rush, Spanish Missions in reverse chronological order has settled this area with several immigrant populations – and today we are thankful for bountiful Pan-Asian, European, Hispanic options in the area.

At this grocery store there was big sign : “NOT DURIAN”, and a 1lb pieces of fruit were marked $5. Fresh Jackfruit is pretty much unheard of in USA except when imported and sliced open by immigrant run grocery stores in diverse communities in the area. Definitely, Bay Area qualifies for such a place. While the sign was written with intent to invite Durian wary folk to try and taste the Jackfruit, it did leave a bad taste before trying out the fruit.

Jackfruit pieces – Not Durian! – https://en.wikipedia.org/wiki/Jackfruit

Maybe, just maybe our languages and heritage are having bad publicity and marketing and sometimes misrepresentation and misinformation to turn away new speakers, learners and teachers, adoption of language in newer markets and products. Maybe our languages are not Durian. We are the Jackfruit.

அடிக்கடி தமிழில் பலர் சொல்லி கேள்விப்பட்டிருக்கேன்: “தமிழ் பலாப்பழம் மாதிரி, வெளியிருந்து உள்ள வர முள்ளாத் தெரியும், ஆனால் சொழ சொழயா பழங்கள் இந்த முள்ளை தாண்டி வந்தால் காத்திருக்கு!”. முயற்சி திருவினையாகும்.

P.S.: Images credit Wikipedia.

Language Transformations

Question  of Translation

How can you convert a text like “Me Amor!” to “என் உயிரே!” [from Spanish to தமிழ்] ? Lets  assume we have Spanish to English and Tamil to English translators [bidirectional with English] then we can convert Spanish to English then to Tamil. Likewise one can translate between any two languages from a clique of languages [so far as the clique is defined such that each language can be translated to at least one other language in clique].

Development – Theory

Language can exist as text (print/message/document) or speech (audio, conversations) etc. Ideas are represented in any language. Ideas originate from one language and move to another, or sometimes originate iñ many lañguages simultaneously. Ideas cañ cross from oñe language to añother via text or speech.

In mathematical terms if we write L as set of lañguages = { L1, L2, .. Ln} and then if we define each language as a tuple Li = (Ti,Si) then we may further define mathematical function operating on text and converting it to speech as :

TTSi : Ti -> Si

we may define a function speech recognition as,

ASRi : Si -> Ti

we may also define a translation function as,

TXij : Li -> Lj

Essentially what we can do is by representing the language as a node in a graph with two text and speech parts to it, we may connect these nodes to each other via the edges – functions – like ASR and TTS, and to nodes of other languages via translators function edge.

In a graph with only two languages [English, Tamil] with all edges representing functions like TTS, ASR within same language and functions like Translator between two languages (one for each direction) we see a graph like the following:

Screen Shot 2018-08-03 at 11.51.08 PM
Fig. 1: Language transformation graph. Nodes represent languages and their components. Edges represent functions like TTS, ASR [for same language] and Translators [directional between languages]. Clearly we may see this is a directed graph with ability to go from a specific language to another language in text or speech or both forms, provided a path exists from source to target language. Using such a graph with no orphan nodes, we may have universal translation powers from language A to language B [so far as bidirectional connectivity is present with at least one neighbor].

Problems to Ponder

So the curious reader now having a background of representing the translation problem as a graph problem of reaching node B from node A, can use rich set of path finding algorithms and shortest distance algorithms may attempt to answer some of these questions:

  1. What is the graph criteria for a language to have no translations ?
  2. What is the graph criteria for a language to not be able to have virtual assistant ? [Siri, Cortana, Alexa etc.]
  3. Conversely, to 2, what is minimum criteria [necessary but not sufficient] to have a virtual assistant [that can speak and listen] ?
  4. Given two paths to translating from language A -> F, which are of two different lengths which one would you choose and why? Assume all jumps have a uniform information loss. What if information loss at each edge is non-uniform, how can you optimized such a problem ?
  5. How would you introduce a new language into this graph so that it maybe translated to all other languages [unidirectionally] ?
  6. How would you introduce a new language into this graph so that it can be bi-directionally translated ?
  7. How can you represent the transliteration function in this graph ?

Answers will be posted soon! Feel free to leave your comments in section below.


நிரலாக்கத்தில் இயற்பெயர்கள் (Native names in Programming)

வணக்கம், வாசகர்களே!

நண்பர் டிவிட்டரில் தான் Julia Language (ஜூலியா) நிரலாக்கத்தில் ஒருங்குறி வசதி இருப்பதால் தான் தமிழ் பெயர்களை பயண்படுத்தி நிரல் எழுத மேர்கள்வதாக கீ்ச்சு அனுப்பினார். எழில், JavaScript, Python, Clojure, Clisp மொழிகளிலும் இதை சொய்யலாம்.

இயற்பெயர்கள் அனைவருக்கும் பிடித்து என்றே தொன்றுகிறது. ஏன்? Apache (அப்பாச்சே) அமெரிக்க பழங்குடி இனத்தின் பெயரில் ஒரு பிரபல திட்டம் பல ஆண்டுகளாக இயங்கிவருது. அதில் Maven (மேவன்) என்ற திட்டம் Yiddish யூதர் மொழி சொல்லை தன்வசப்படுத்தி திட்டத்தின் பெயராக்கியது.

Maven, a Yiddish word meaning accumulator of knowledge,

Apache Maven project uses words from both Native American, and Yiddish language Apache Maven project. Apache, and Maven are names originating from Native American tribe, and Yiddish languages respectively.

ஆங்கிலத்தில் ‘Kanmani’ என்று பெயர் ஒலிபெயர்க்கிரோம்; ஆனால் தமிழ் பெயர் வழி யொசித்தால், அது அவ்விடத்தில் பொருத்தமாக இருந்தால் சும்மாத்தான் வைத்துப் பாருங்களேன்!



Tamil Language – Longest word and Lexicography

Hello traveler and Tamil language aficionado. Today I’m researching about Tamil lexicography, and I’m sharing the results of my searches through this blog post. It is more on the research side, than demo’s or expository blogs of the post.

Longest Tamil Word

  1. Senthil Nathan of Arithi.com has blogged about using UTF-8 in Tamil text processing. Something we like at the open-tamil project. Check our Python codes if you have not already.
  2. In this article he posits the longest Tamil word has to be the proper-noun, “திருவாலவாயுடையார்திருவிலையாடற்புராணம்“. Any comments on that? I think if we looked at verbs of adjectives we may reach the proper answer.  Lets try and answer this question with the open-tamil tools (assuming you have installed it!) and type the code at the Python shell.
>> import tamil
>> len(tamil.utf8.get_letters(u'திருவாலவாயுடையார்திருவிலையாடற்புராணம்'))

     3. Now we realize this is only 20 letters long. Comparatively the English word ‘pneumonoultramicro silico coniosis‘, a disorder          where the lungs are affected by silicion particulate matter, measures to be upto a whooping, 45 letters long!

 Update #2 – Longest Tamil word! (04/28/2014)

Since the original post I have a possible candidate word (not a proper-noun) which is 15 letters long, “புத்திரபௌத்திரபாரம்பரியம்”. Look up புத்திரபௌத்திரபாரம்பரியம். See also words,

  1. புத்திரபௌத்திரபாரம்பரியம்
  2. முப்பொழுதுந்திருமேனிதீண்டுவார்
  3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்

Lexicographic Order – Dictionary Order

In English language ‘AVOCADO’ comes after the word ‘APPLE’ in the dictionary, because of the dictionary-order or lexicographic convention. It is often preplexing to me that Tamil language sorting is not well defined.

  1. Our vowels, 12 of ‘அ,ஆ,இ,ஈ, – ஒ,ஓ,ஔ,ஃ’ are well ordered.
  2. But the consonants, 18 of ‘க,ச,ட,த,ப,ற, … ஞ.ங,ண,ந,ம,ன’ are not because there is more than one ordering. What is the norm here?
  3. So in combination the 247 Tamil letters don’t have a canonical dictionary order.
  4. This lack of lexicographic ordering convention makes dictionary ordering of Tamil words difficult. Clearly we could make a choice, but what is the norm?
  5. What is your language solution? Share your comments and strategies.

Afterword (Update)

  1. Turns out it is not too hard to implement Lexicographic ordering in Python/Open-Tamil
  2. Code requires you to define a comparison function and use it with the sort() method. Comparison method knows the relative ordering of the letters in Tamil character, as you see in this commit, defining the functions
    1. def compare_words_lexicographic( word_a, word_b ):
    2. def all_tamil( words ):
  3. Turns out all this is pretty neat stuff for the Tamil text processing in open-tamil/utf-8 package!