in status

Tamil NLP Processing with Tamil Shallow Parser

I have been searching for NLP (Natural Language Processing) tools to process Tamil Texts, primarily to find the root word and the grammars associated with it. Able to lot of new NLP buzzwords like Stemming, Lemmatization, NER (Named Entity Recognition).

It is unfortunate that there is no developer friendly tool available which is quick to use and build apps. Even though there are few (very!) initiatives and framework/tools on Github, somehow they didn’t fit well for use cases. I am not familiar with Python which is THE language Data Scientists, researchers user, but even there I wasn’t able to find good one.

After few years of quests (since 2014!) of asking friends, professors and researchers finally stumble upon tamil shallow parser by LTRC Lab, IIIT-Hyderabad. It was solving one of my primary use case of finding root words (கர்ணனின் – Root Word -> கர்ணன்). This would allow to search for கர்ணன் and find கர்ணனின் as both are same, but computers can’t understand unless we map both to same word.

When I was trying to download the app and setup on latest OS, it didn’t work and digging up the code, trial and error found few issues and made it as a docker image for Tamil enthusiasts to explore further. This is available in github @ https://github.com/ithiru/tamil-shallow-parser

Please create issues in Github if you find any issues, wants to add any enhancements etc.,

Tamil Shallow Parser

History

Port

This was developed sometime in 2009 and tested on Fedora 8. I was recently trying to do some Tamil NLP processing and tried to install on latest OS and didn’t work. Then I grabbed Fedora 8 ISO file, configured in Hyper V and worked.

Wanted to run this on Docker with latest OS, so this can be utilized by other Tamil enthusiasts.

During my analysis found the following.

  • component/binaries are not compatible with latest 64 bit OS, had to use 32 bit OS.
  • Doesn’t compile on latest build tool due to older CRF++ library. Had to grab the latest one @ https://taku910.github.io/crfpp/
  • Find the other binary dependencies the tool used (dos2unix, for example)

Docker

The parser can be run by following the steps

  • docker pull ithiru/tamil-shallow-parser
  • docker run --rm -i -t ithiru/tamil-shallow-parser:latest

By default it would display the test case output similar to one below.

<Sentence id="1">
1       ((      NP      <fs af='குவிப்பு,n,any,sg,any,d,,' case_name="nom"  head="குவிப்பு"  paradigm="n4"  poslcat="NM">
1.1     சொத்து  JJ      <fs af='சொத்து,n,any,sg,any,d,,' paradigm="n4"  poslcat="NM"  case_name="nom">
1.2     குவிப்பு        NN      <fs af='குவிப்பு,n,any,sg,any,d,,' name="குவிப்பு"  case_name="nom"  poslcat="NM"  paradigm="n4">
1.3     :       SYM     <fs af=':,punc,,,,,,' poslcat="NM">
        ))
2       ((      NP      <fs af='விசாரணை,n,any,sg,any,d,க்கு,kku' case_name="dat"  head="விசாரணைக்கு"  paradigm="n2"  poslcat="NM">
2.1     விசாரணைக்கு     NNP     <fs af='விசாரணை,n,any,sg,any,d,க்கு,kku' case_name="dat"  name="விசாரணைக்கு"  paradigm="n2"  poslcat="NM">
        ))
3       ((      NP      <fs af='டெல்லி,n,any,sg,any,d,,' case_name="nom"  head="டெல்லி"  paradigm="n2"  poslcat="NM">
3.1     டெல்லி  NN      <fs af='டெல்லி,n,any,sg,any,d,,' name="டெல்லி"  case_name="nom"  paradigm="n2"  poslcat="NM">
        ))
4       ((      VGNF    <fs af='செல்லம்,n,any,sg,any,d,,a' head="செல்ல"  case_name="nom"  paradigm="n13"  adj="a">
4.1     செல்ல   VM      <fs af='செல்லம்,n,any,sg,any,d,,a' adj="a"  paradigm="n13"  case_name="nom"  name="செல்ல">
        ))
5       ((      NP      <fs af='மதுகோடா,unk,,,,,,' head="மதுகோடா"  poslcat="NM">
5.1     மறுத்த  JJ      <fs af='மறு,v,any,any,any,,த்த்_அ,ww_a' poslcat="NM"  paradigm="v11"  tense="PAST"  rp="Y">
5.2     மதுகோடா NNP     <fs af='மதுகோடா,unk,,,,,,' name="மதுகோடா"  poslcat="NM">
5.3     !       SYM     <fs af='!,punc,,,,,,' poslcat="NM">
        ))
</Sentence>

Parsing Texts

You can run the parser manually with your own inputs by following the steps

  • docker run --rm -i -t ithiru/tamil-shallow-parser:latest /bin/bash
  • shallow_parser_tam < /root/app/nlp/tests/shallowparser_tam_utf.rin

The above will dump the same output as above.

shallow_parser_tam --help – this will provide the following usage.

shallow-parser-tam version 3.0
usage : shallow_parser_tam  --mode=[debug|fast] --in_encoding=[wx|utf] --out_encoding=[wx|utf] --input=<input_file> --output=<output_file>

  --in_encoding  : Encoding of the Input Text [utf or wx]
  --out_encoding : Encoding of the Output Text [utf or wx]
  --mode         : Debug or Fast mode [debug|fast] *Default fast mode
  --input        : Input file
  --output       : Output file
                  Prepared as a part of SAMPARK (ILMT Consortium Project)
                  Author: Avinesh PVS
                  IIIT Hyderabad {[email protected]}

Write a Comment

Comment