Recently, while experimenting natural language processing, I was facing a huge issue with speed and accuracy.
Found an extreme popular blog post which contain, all details about part of speech tagging with NLTK’s different taggers. Refer below link for more details.
For dealing with large amount of data (example an entire book content), you need an extreme accurate and fastest tagger. I found that Hunpos tagger is a solution for fastest tagger. Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.
But while installing that, I found lot of issues. I couldn’t find any tutorial for installing hunpos tagger properly. Follow below proceedure for experimenting with hunpos tagger.
Installing and testing HunposTagger:
1) Download Subversion SVN if you don’t have it and check out the project source code:
svn checkout http://hunpos.googlecode.com/svn/trunk/ trunk
2) Then, to successfully build it, you might need ocamlbuild for automatic compiling of Objective Caml.
sudo apt-get install ocaml-nox
should handle this.
3) cd to the trunk directory (where you downloaded Hunpos source code) and do
4) At this point, you shall have a binary file tagger.native in your trunk directory. Change name of tagger.native to hunpos-tag, Put the hunpos-tag
in your /usr/local/bin (you may need to do it as super user).
5) Download en_wsj.model.gz file here, unzip it and put the en_wsj.model binary also in usr/local/bin.
or Do the following:
After extracting file, check the name of model is english.model, else rename to english.model
6) set an environmental variable HUNPOS, To do this, you may add the following line to your ~/.bashrc (or ~/.bash_profile in MacOS):
7) Finally, in your python script, you may create an instance of HunposTagger class passing the paths to both files you have created previously, something very close to:
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize
corpus = “so how do i hunpos tag my ntuen ? i can’t get the following code to work.”
Thats it! 😀