Accurate and Fastest NLTK tagger.

Recently, while experimenting natural language processing, I was facing a huge issue with speed and accuracy.

Found an extreme popular blog post which contain, all details about part of speech tagging with NLTK’s different taggers. Refer below link for more details.

http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/

For dealing with large amount of data (example an entire book content), you need an extreme accurate and fastest tagger. I found that Hunpos tagger is a solution for fastest tagger. Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.

But while installing that, I found lot of issues. I couldn’t find any tutorial for installing hunpos tagger properly. Follow below proceedure for experimenting with hunpos tagger.

Installing and testing HunposTagger:

1) Download Subversion SVN if you don’t have it and check out the project source code:

       svn checkout http://hunpos.googlecode.com/svn/trunk/ trunk

2) Then, to successfully build it, you might need ocamlbuild for automatic compiling of Objective Caml.

       sudo apt-get install ocaml-nox

should handle this.

3) cd to the trunk directory (where you downloaded Hunpos source code) and do

       ./build.sh build

4) At this point, you shall have a binary file tagger.native in your trunk directory. Change name of tagger.native to hunpos-tag, Put the hunpos-tag
in your /usr/local/bin (you may need to do it as super user).

5) Download en_wsj.model.gz file here, unzip it and put the en_wsj.model binary also in usr/local/bin.

       https://code.google.com/p/hunpos/downloads/list

or Do the following:

       wget https://hunpos.googlecode.com/files/en_wsj.model.gz

NOTE:
=====

After extracting file, check the name of model is english.model, else rename to english.model

6) set an environmental variable HUNPOS, To do this, you may add the following line to your ~/.bashrc (or ~/.bash_profile in MacOS):

       export HUNPOS=/usr/local/bin
       export HUNPOS_HOME=/usr/local/bin

7) Finally, in your python script, you may create an instance of HunposTagger class passing the paths to both files you have created previously, something very close to:

import nltk
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize

corpus = “so how do i hunpos tag my ntuen ? i can’t get the following code to work.”
ht= HunposTagger(‘/usr/local/bin/english.model’)
print ht.tag(word_tokenize(corpus))

Thats it! 😀

Advertisements

3 thoughts on “Accurate and Fastest NLTK tagger.

  1. Kevin Prichard says:

    Any ideas why the pos would come out as None?

    >>> print ht.tag(word_tokenize(corpus))
    [(‘so’, None), (‘how’, None), (‘do’, None), (‘i’, None), (‘hunpos’, None), (‘tag’, None), (‘my’, None), (‘ntuen’, None), (‘?’, None), (‘i’, None), (‘ca’, None), (“n’t”, None), (‘get’, None), (‘the’, None), (‘following’, None), (‘code’, None), (‘to’, None), (‘work’, None), (‘.’, None)]

    Thanks for this pos, Anu, can’t wait to get this working.

    • [(‘so’, ‘RB’), (‘how’, ‘WRB’), (‘do’, ‘VBP’), (‘i’, ‘FW’), (‘hunpos’, ‘NN’), (‘tag’, ‘NN’), (‘my’, ‘PRP$’), (‘ntuen’, ‘NN’), (‘?’, ‘.’), (‘i’, ‘FW’), (‘cant’, ‘JJ’), (‘get’, ‘VB’), (‘the’, ‘DT’), (‘following’, ‘JJ’), (‘code’, ‘NN’), (‘to’, ‘TO’), (‘work’, ‘VB’), (‘.’, ‘.’)]

      The problem is because of Hunpos tagger model file.It does n’t have any set of words. Check whether english.model is available in following path (‘/usr/local/bin/english.model’) If not follow the instruction given for installing hunpos tagger. It is bit difficult to install hunpos tagger, because you will not get much information from internet. Also check whether you have set the environment variable properly!

  2. The output should be as following:

    [(‘so’, ‘RB’), (‘how’, ‘WRB’), (‘do’, ‘VBP’), (‘i’, ‘FW’), (‘hunpos’, ‘NN’), (‘tag’, ‘NN’), (‘my’, ‘PRP$’), (‘ntuen’, ‘NN’), (‘?’, ‘.’), (‘i’, ‘FW’), (‘cant’, ‘JJ’), (‘get’, ‘VB’), (‘the’, ‘DT’), (‘following’, ‘JJ’), (‘code’, ‘NN’), (‘to’, ‘TO’), (‘work’, ‘VB’), (‘.’, ‘.’)]

    The problem is because of Hunpos tagger model file.It does n’t have any set of words. Check whether english.model is available in following path (‘/usr/local/bin/english.model’) If not follow the instruction given for installing hunpos tagger. It is bit difficult to install hunpos tagger, because you will not get much information from internet. Also check whether you have set the environment variable properly!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s