Tamil Entry via Keypad – 9XYZ30-த-மி-ழ்

Previously, My initial calculations can be revised in terms of the estimates. I will not go into further detail here; my latest estimate shows the number of realizable keyboards to be 264,250,749,803,040 or 264billion – a bit of an astronomical number.

The money questions are the following:

  1. Given the astronomical size of keyboards possible is there one that is easily decodable than the other ? Yes, or no ?
  2. Is there any decodable keyboard at all?
  3. Is there a ‘1-800-FLOWERS‘ type of representation possible atleast for a few words in Tamil ?

Today, I was toying with some simple designs and made it into software:

Fig 1: Simple 4×3 keypad layout in iOS

One particular realization of the keyboard looks like where 20 Tamil letters are roughly mapped into 1 keypad as shown in the excel sheet below. We also see the canonical 4×3 keypad matrix in the rows 20-23 showing the 12 keypad positions where 20 letters are going to be mapped into.

Fig 2: Mapping first 20 letters of Tamil alphabet set into a 4×3 keypad.

We show how the phone number “9XYZ30477” will mean “9XYZ30-த-மி-ழ்” in this keypad.

Fig. 3: A simple realization of keypad mapping in Tamil; e.g. number “9XYZ30477” would can be advertised as ‘9XYZ30-த-மி-ழ்’.

Immediately few things are coming to our attention:

  1. Entering user input in the keypad is easy; we follow a simple natural language suggested representation
  2. However, we have some issues in realizing this keyboard – ambiguity: Does ‘111’ in this keypad entry, with following mapping shown, mean ‘அக்கா’ or ‘கட்சி’ ?
  3. The “obvious” finite ring keypad mapping fails here.

Realizations:

  1. Whereas a simple keyboard realization of this scheme shows words typed of equal length like ‘அக்கா’ and ‘கட்சி’ are completely undecidable/un-decodeable. So our criteria is really the good realizable keyboard maximizes the word decidability, or minimizes word collision.
  2. Ease of user input:
    Also we may want to make ease of user entry into this keyboard simpler [which the ‘obvious choice’ keyboard contains] while still maintaining the decodability.
  3. We identify the mapping used above with a simple algebraic structure similar to a finite semi-group with operations of commutativity, in-group operation and identity formed by ‘ஃ’ ayutha letter. This is a interesting mapping with potential to adapt the operator for creating a full semigroup or group structure for the language.
  4. Finally we discover:

The letters with the high bi-gram frequency may not be co-occurring in the same keypad square. This is an operational principle that will reduce the ambiguity of the model. We will have to balance this with other decidability criteria of user input etc.

Operating Principle – we understand this from our failed experiment.

This type of keyboard design could also equally apply for other Abugida languages – which is most Indian languages.

Tamil in Morse-code

Can we compose a Tamil Morse-code ? Yes, we can.

315px-International_Morse_Code.svg

International Morse Code – Source: Wikipedia

  1. Start with a frequency count of Tamil letters from various sources
  2. Build a probability distribution from the frequency counts
  3. Build a Huffman code using the above distribution
  4. Each letter of Tamil alphabet gets a Morse code : 0 = ‘.’, 1 – ‘-‘.
    புள்ளி, கோடு.

Tamil Morse Code Table generated from Open-Tamil library. See here for full code and methodology. Full table follows.

Can you decode what this Morse code means in Tamil ? Hint: 2 words (4,5) letters long

...-. --.--.. .---..--.--- .-..-. ...-. ---.-. -----.--.- .--....- ..-..-

Please note table was updated to show letters in most-frequent to least-frequent alphabets and their code-words used. Updated after publishing on Aug 16th, 2018.

Source coding theory

Information theory provides us with tools to calculate the information content of symbols in a language, i.e. alphabets in our case. Average codeword length was 6.45652 bits, which is rounded to 7bits.
According to 230+ symbols of encoded in binary without attention to letter frequency we would be using ceil[ log2[230] ] ~ 8bits per symbol, so the usage of Morse code provides a related data compression of 12.5%!

Previously, I had written about Morse code for Tamil in this blog here, and relationship with Unigram, Bigram and Trigram models and word-structure in Tamil language.

  1. ம் -> --..
  2. த -> -...
  3. க -> ...-.
  4. ல் -> ..---
  5. த் -> ----.
  6. க் -> -.---
  7. ன் -> -.--.
  8. ர -> .....-
  9. ப -> ....--
  10. வ -> ..--.-
  11. தி -> ..-..-
  12. ச -> ..-.-.
  13. கு -> .----.
  14. ம -> .---.-
  15. ப் -> .--..-
  16. ட் -> .--.-.
  17. டு -> .-...-
  18. ர் -> .-..-.
  19. ய -> .-.-.-
  20. அ -> ---..-
  21. ட -> ---.--
  22. ரு -> ---.-.
  23. பு -> -..---
  24. கா -> -..--.
  25. து -> -..-.-
  26. ல -> -.-..-
  27. வி -> .......
  28. டி -> ....-..
  29. ண் -> ....-.-
  30. சி -> ...---.
  31. ன -> ..--...
  32. ரி -> ..-....
  33. ங் -> ..-...-
  34. ந் -> ..-.---
  35. ற் -> .-----.
  36. இ -> .--...-
  37. று -> .-..---
  38. ச் -> .-....-
  39. சு -> .-..--.
  40. பா -> .-.----
  41. கி -> .-.--..
  42. பி -> .-.--.-
  43. வா -> .-.-...
  44. மு -> -----..
  45. ள் -> ---....
  46. லை -> --.--..
  47. உ -> --.--.-
  48. டை -> --.-..-
  49. தா -> --.-.--
  50. ண -> -..-...
  51. கை -> -..-..-
  52. ஆ -> -.-...-
  53. மா -> -.-.---
  54. ய் -> -.-.-.-
  55. ள -> ......-.
  56. சா -> ...--..-
  57. ற -> ...--.--
  58. லி -> ..--..--
  59. வு -> .---...-
  60. கொ -> .---..-.
  61. ந -> .--.....
  62. நி -> .--....-
  63. ஞ் -> .--.----
  64. ரா -> .--.---.
  65. ணி -> .--.--..
  66. ளி -> .--.--.-
  67. யா -> .-......
  68. நா -> .-.-..--
  69. றி -> .-.-..-.
  70. கோ -> -------.
  71. செ -> ------..
  72. ழி -> ------.-
  73. னி -> -----.-.
  74. ழு -> --.-----
  75. மி -> --.----.
  76. யி -> --.-....
  77. பொ -> --.-.-..
  78. ரை -> --.-.-.-
  79. வெ -> -.-.....
  80. எ -> -.-.--..
  81. மை -> -.-.--.-
  82. றை -> -.-.-..-
  83. பூ -> ......--.
  84. ழ -> ...-----.
  85. னை -> ...----..
  86. லா -> ...--.-..
  87. சை -> ..--..-.-
  88. வை -> ..-.--...
  89. போ -> ..-.--..-
  90. கூ -> ..-.--.-.
  91. வே -> .--------
  92. டா -> .-------.
  93. தை -> .------..
  94. பெ -> .---....-
  95. ளை -> .---..---
  96. தே -> .-.---...
  97. ஒ -> .-.---.--
  98. ழ் -> -----.---
  99. லு -> ---...---
  100. நீ -> ---...-..
  101. சீ -> ---...-.-
  102. தீ -> --.---...
  103. மூ -> --.---..-
  104. தொ -> --.---.--
  105. ணை -> --.---.-.
  106. ஏ -> --.-...-.
  107. நெ -> -.-....-.
  108. ளு -> -.-.-....
  109. னா -> ......----
  110. சூ -> ......---.
  111. மே -> ...-------
  112. தோ -> ...------.
  113. தெ -> ...----.-.
  114. சொ -> ...--.....
  115. சே -> ...--....-
  116. தூ -> ...--...--
  117. யு -> ...--...-.
  118. பே -> ...--.-.--
  119. வீ -> ..--..-..-
  120. ஊ -> .------.--
  121. னு -> .---......
  122. யோ -> .---.....-
  123. சோ -> .---..--..
  124. கே -> .-.....---
  125. ழை -> .-.....--.
  126. ணு -> .-.---..--
  127. ஓ -> .-.---.-..
  128. கெ -> ----------
  129. கீ -> --------..
  130. றா -> --------.-
  131. பை -> -----.--..
  132. ணா -> -----.--.-
  133. ரோ -> ---...--.-
  134. மொ -> -.-....--.
  135. மெ -> -.-.-...--
  136. லோ -> ...----.---
  137. பீ -> ...----.--.
  138. ளா -> ...--.-.-.-
  139. ஈ -> ..--..-....
  140. ஞா -> ..--..-...-
  141. மீ -> ..-.--.----
  142. வ் -> ..-.--.--..
  143. மோ -> ..-.--.--.-
  144. நு -> .---..--.-.
  145. ஐ -> .-.....-..-
  146. ரே -> .-.....-.-.
  147. நோ -> .-.---..-.-
  148. நே -> .-.---.-.--
  149. நூ -> ---------..
  150. யெ -> --.-...----
  151. லே -> --.-...--..
  152. ரீ -> -.-....----
  153. நொ -> -.-....---.
  154. யை -> -.-.-...-..
  155. ழா -> ...--.-.-...
  156. ரூ -> ...--.-.-..-
  157. னோ -> .------.-.--
  158. ஞ -> .---..--.---
  159. யூ -> .---..--.--.
  160. வோ -> .-.....-....
  161. யே -> .-.....-.---
  162. லெ -> .-.---..-...
  163. ரெ -> .-.---.-.-.-
  164. ணீ -> ---...--....
  165. டோ -> ---...--..--
  166. டெ -> ---...--...-
  167. கௌ -> ---...--..-.
  168. ணெ -> --.-...---..
  169. சௌ -> --.-...---.-
  170. றெ -> ..-.--.---...
  171. லூ -> ..-.--.---..-
  172. றோ -> .------.-....
  173. னே -> ..-.--.---.--
  174. னீ -> .------.-..-.
  175. நை -> .------.-..--
  176. டூ -> .------.-.-..
  177. னெ -> .-.....-.--..
  178. டே -> .-.....-.--.-
  179. ஞெ -> .-.---..-..--
  180. ளெ -> .-.---.-.-...
  181. டீ -> ---------.---
  182. யொ -> ---------.--.
  183. பௌ -> ---------.-..
  184. ஃ -> --.-...--.---
  185. ஔ -> --.-...--.-..
  186. ஞை -> -.-.-...-.---
  187. யீ -> -.-.-...-.--.
  188. றொ -> -.-.-...-.-.-
  189. வொ -> .------.-...--
  190. வூ -> ..-.--.---.-..
  191. னூ -> .------.-.-.--
  192. ளோ -> .-.....-...---
  193. ணோ -> .------.-.-.-.
  194. றே -> .-.....-...--.
  195. மௌ -> .-.....-...-..
  196. தௌ -> .-.---..-..-..
  197. ளே -> .-.---.-.-..-.
  198. லொ -> .-.---.-.-..--
  199. றூ -> ---------.-.--
  200. ரொ -> --.-...--.--..
  201. டொ -> --.-...--.-.-.
  202. ங -> -.-.-...-.-...
  203. ணே -> ..-.--.---.-.--
  204. ளீ -> .------.-...-..
  205. ழூ -> .-.....-...-.-.
  206. ளொ -> .-.---..-..-.-.
  207. ரௌ -> .-.---..-..-.--
  208. யௌ -> ---------.-.-..
  209. னொ -> ---------.-.-.-
  210. ழோ -> --.-...--.-.--.
  211. ளூ -> --.-...--.-.---
  212. ஞி -> -.-.-...-.-..--
  213. ணொ -> .-.....-...-.---
  214. ணூ -> .------.-...-.--
  215. ழீ -> .-.....-...-.--.
  216. ஸ் -> --.-...--.--.--.
  217. வௌ -> -.-.-...-.-..-..
  218. ஞீ -> --.-...--.--.---
  219. ஷ் -> ..-.--.---.-.-...
  220. ஷி -> ..-.--.---.-.-..-
  221. ழெ -> ..-.--.---.-.-.-.
  222. றீ -> .------.-...-.-.-
  223. நௌ -> ..-.--.---.-.-.--
  224. ஞே -> .------.-...-.-..
  225. லௌ -> --.-...--.--.-..-
  226. ஞொ -> -.-.-...-.-..-.--
  227. ஙு -> --.-...--.--.-...
  228. ஷ -> --.-...--.--.-.---
  229. ழொ -> --.-...--.--.-.--.
  230. ழே -> -.-.-...-.-..-.-.
  231. டௌ -> --.-...--.--.-.-.-
  232. ஞூ -> --.-...--.--.-.-..

Caveats and Closing Comments

Of course 15 of 247 letters are perhaps not received any codeword in this codebook. Further with inclusion of Grantha letters, 323 letters exist in Tamil some of which we don’t have code words.

Further, a large text corpus like Project Madurai’s [PM] unigram frequency distribution maybe useful to develop a widely representative Morse code table. Once you have this PM unigram data, you know how to get this Tamil Morse codebook regenerated!