Language Transformations

Question  of Translation

How can you convert a text like “Me Amor!” to “என் உயிரே!” [from Spanish to தமிழ்] ? Lets  assume we have Spanish to English and Tamil to English translators [bidirectional with English] then we can convert Spanish to English then to Tamil. Likewise one can translate between any two languages from a clique of languages [so far as the clique is defined such that each language can be translated to at least one other language in clique].

Development – Theory

Language can exist as text (print/message/document) or speech (audio, conversations) etc. Ideas are represented in any language. Ideas originate from one language and move to another, or sometimes originate iñ many lañguages simultaneously. Ideas cañ cross from oñe language to añother via text or speech.

In mathematical terms if we write L as set of lañguages = { L1, L2, .. Ln} and then if we define each language as a tuple Li = (Ti,Si) then we may further define mathematical function operating on text and converting it to speech as :

TTSi : Ti -> Si

we may define a function speech recognition as,

ASRi : Si -> Ti

we may also define a translation function as,

TXij : Li -> Lj

Essentially what we can do is by representing the language as a node in a graph with two text and speech parts to it, we may connect these nodes to each other via the edges – functions – like ASR and TTS, and to nodes of other languages via translators function edge.

In a graph with only two languages [English, Tamil] with all edges representing functions like TTS, ASR within same language and functions like Translator between two languages (one for each direction) we see a graph like the following:

Screen Shot 2018-08-03 at 11.51.08 PM

Fig. 1: Language transformation graph. Nodes represent languages and their components. Edges represent functions like TTS, ASR [for same language] and Translators [directional between languages]. Clearly we may see this is a directed graph with ability to go from a specific language to another language in text or speech or both forms, provided a path exists from source to target language. Using such a graph with no orphan nodes, we may have universal translation powers from language A to language B [so far as bidirectional connectivity is present with at least one neighbor].

Problems to Ponder

So the curious reader now having a background of representing the translation problem as a graph problem of reaching node B from node A, can use rich set of path finding algorithms and shortest distance algorithms may attempt to answer some of these questions:

  1. What is the graph criteria for a language to have no translations ?
  2. What is the graph criteria for a language to not be able to have virtual assistant ? [Siri, Cortana, Alexa etc.]
  3. Conversely, to 2, what is minimum criteria [necessary but not sufficient] to have a virtual assistant [that can speak and listen] ?
  4. Given two paths to translating from language A -> F, which are of two different lengths which one would you choose and why? Assume all jumps have a uniform information loss. What if information loss at each edge is non-uniform, how can you optimized such a problem ?
  5. How would you introduce a new language into this graph so that it maybe translated to all other languages [unidirectionally] ?
  6. How would you introduce a new language into this graph so that it can be bi-directionally translated ?
  7. How can you represent the transliteration function in this graph ?

Answers will be posted soon! Feel free to leave your comments in section below.

-Muthu

நிரலாக்கத்தில் இயற்பெயர்கள் (Native names in Programming)

வணக்கம், வாசகர்களே!

நண்பர் டிவிட்டரில் தான் Julia Language (ஜூலியா) நிரலாக்கத்தில் ஒருங்குறி வசதி இருப்பதால் தான் தமிழ் பெயர்களை பயண்படுத்தி நிரல் எழுத மேர்கள்வதாக கீ்ச்சு அனுப்பினார். எழில், JavaScript, Python, Clojure, Clisp மொழிகளிலும் இதை சொய்யலாம்.

இயற்பெயர்கள் அனைவருக்கும் பிடித்து என்றே தொன்றுகிறது. ஏன்? Apache (அப்பாச்சே) அமெரிக்க பழங்குடி இனத்தின் பெயரில் ஒரு பிரபல திட்டம் பல ஆண்டுகளாக இயங்கிவருது. அதில் Maven (மேவன்) என்ற திட்டம் Yiddish யூதர் மொழி சொல்லை தன்வசப்படுத்தி திட்டத்தின் பெயராக்கியது.

Maven, a Yiddish word meaning accumulator of knowledge,

Apache Maven project uses words from both Native American, and Yiddish language Apache Maven project. Apache, and Maven are names originating from Native American tribe, and Yiddish languages respectively.

ஆங்கிலத்தில் ‘Kanmani’ என்று பெயர் ஒலிபெயர்க்கிரோம்; ஆனால் தமிழ் பெயர் வழி யொசித்தால், அது அவ்விடத்தில் பொருத்தமாக இருந்தால் சும்மாத்தான் வைத்துப் பாருங்களேன்!

அன்புடன்,

-முத்து

Tamil Language – Longest word and Lexicography

Hello traveler and Tamil language aficionado. Today I’m researching about Tamil lexicography, and I’m sharing the results of my searches through this blog post. It is more on the research side, than demo’s or expository blogs of the post.

Longest Tamil Word

  1. Senthil Nathan of Arithi.com has blogged about using UTF-8 in Tamil text processing. Something we like at the open-tamil project. Check our Python codes if you have not already.
  2. In this article he posits the longest Tamil word has to be the proper-noun, “திருவாலவாயுடையார்திருவிலையாடற்புராணம்“. Any comments on that? I think if we looked at verbs of adjectives we may reach the proper answer.  Lets try and answer this question with the open-tamil tools (assuming you have installed it!) and type the code at the Python shell.
>> import tamil
>> len(tamil.utf8.get_letters(u'திருவாலவாயுடையார்திருவிலையாடற்புராணம்'))
           20

     3. Now we realize this is only 20 letters long. Comparatively the English word ‘pneumonoultramicro silico coniosis‘, a disorder          where the lungs are affected by silicion particulate matter, measures to be upto a whooping, 45 letters long!

 Update #2 – Longest Tamil word! (04/28/2014)

Since the original post I have a possible candidate word (not a proper-noun) which is 15 letters long, “புத்திரபௌத்திரபாரம்பரியம்”. Look up புத்திரபௌத்திரபாரம்பரியம். See also words,

  1. புத்திரபௌத்திரபாரம்பரியம்
  2. முப்பொழுதுந்திருமேனிதீண்டுவார்
  3. ஒதுக்குப்பொதுக்குப்பண்ணுதல்

Lexicographic Order – Dictionary Order

In English language ‘AVOCADO’ comes after the word ‘APPLE’ in the dictionary, because of the dictionary-order or lexicographic convention. It is often preplexing to me that Tamil language sorting is not well defined.

  1. Our vowels, 12 of ‘அ,ஆ,இ,ஈ, – ஒ,ஓ,ஔ,ஃ’ are well ordered.
  2. But the consonants, 18 of ‘க,ச,ட,த,ப,ற, … ஞ.ங,ண,ந,ம,ன’ are not because there is more than one ordering. What is the norm here?
  3. So in combination the 247 Tamil letters don’t have a canonical dictionary order.
  4. This lack of lexicographic ordering convention makes dictionary ordering of Tamil words difficult. Clearly we could make a choice, but what is the norm?
  5. What is your language solution? Share your comments and strategies.

Afterword (Update)

  1. Turns out it is not too hard to implement Lexicographic ordering in Python/Open-Tamil
  2. Code requires you to define a comparison function and use it with the sort() method. Comparison method knows the relative ordering of the letters in Tamil character, as you see in this commit, defining the functions
    1. def compare_words_lexicographic( word_a, word_b ):
    2. def all_tamil( words ):
  3. Turns out all this is pretty neat stuff for the Tamil text processing in open-tamil/utf-8 package!

-Muthu