உளியருவி – Tamil tools for AI/ML

Motivation

In 2022 we are reaching a point where more Tamil datasets are available than Tamil tools – arunthamizh அருந்தமிழ். However the accessibility of fully-trained models and capability of providing pre-trained models are much harder and still require domain expertise in hardware and software. Personally I have published some small Jupyter notebooks (see here), and some simple articles, but they still remain inadequate to scale the breadth of Tamil computing needs in AI world among:

  1. NLP – Text Classification, Recommendation, Spell Checking, Correction tasks
  2. TTS – speech synthesis tasks
  3. ASR – speech recognition

While sufficient data exist for 1, the private corpora for speech tasks (அருந்தமிழ் பட்டியல்), the public corpora of a 300hr voice dataset recently published from Mozilla Common Voice (University of Toronto, Scarborough, Canada leading Tamil effort here) have enabled data completion to a large degree for tasks 2 and 3.

Ultimately the tooling provides capability to quickly compose AI services based on open-source tools and existing compute environment to host services and devices in Tamil space.

Proposal

My proposal is the following:

  1. Develop a open-source toolbox for pre-training and task training specialization
  2. Identify good components to base effort
  3. Contribute engineering effort, testing, and validation
    1. R&D – DataScience, Infra, AI framework
    2. Engineering Validation – DataScience, Tamil language expertise
    3. Engineering – packaging, documentation, distribution
    4. Project management
  4. Library to be liberally licensed MIT/BSD
  5. Open-Source license for developed models
  6. Find hardware resources for AI model pre-training etc.
  7. Managed by a steering committee / nominated BDFL
  8. Scope – decade time frame
  9. TBD – மேலும் பல.

Summary

Let’s build a pytorch-lightning like API for Tamil tasks across NLP, TTS, ASR via AI.

Leave your thoughts by email ezhillang -at- gmail -dot- com, or in comments section.

மறுமொழியொன்றை இடுங்கள்

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  மாற்று )

Twitter picture

You are commenting using your Twitter account. Log Out /  மாற்று )

Facebook photo

You are commenting using your Facebook account. Log Out /  மாற்று )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.