This document provides a rudimentary introduction to the moot PoS tagging utilities. It should under no circumstances be considered a substitute for the individual program documentation.
The moot utilities are designed for Part-of-Speech Tagging: assigning a single univocal tag to each input token. The runtime tagger (moot) determines which tags to assign by (sequence) maximization of uni-, bi-, and trigram probabilities, as well as lexical probabilties and optional lexical-class probabilities. Probability data is passed to the runtime tagger in a model. Runtime data to the tagger must be tokenized ("cooked"), and may optionally include for each input token a set of possible analyses (a lexical class) for that token.
Before an input file can be tagged, statistical data in the form of a model must first be provided. The easiest way to produce such a model is by using the mootrain utility to gather frequency data from a pre-tagged corpus, thus inducing a maximum-likelihood model.
If you have a text-format corpus in the file "corpus.ttt" which is pre-tagged with the "correct" part-of-speech tags, then the incantation:
mootrain --lex --ngrams corpus.ttt
will produce the model files "corpus.lex" and "corpus.123" required for runtime tagging of tokens cooked "rare".
If you have a text-format corpus in the file "corpus.wdt" which is pre-tagged with the "correct" part-of-speech tags as well as lexical classes, then the incantation:
mootrain --lex --ngrams --classes corpus.wdt
will produce the model files "corpus.lex", "corpus.123", and "corpus.clx" required for runtime tagging of tokens cooked "medium rare".
If you are using mootm(1) to analyze incoming tokens, but your training corpus "corpus.ttt" contains only tags (and not analyses), you may generate an appropriate "well done" corpus file "corpus.wdt" from "corpus.ttt" by calling:
mootm -12 -m morph.gfst -s morph.lab -a -o corpus.wdt corpus.ttt
before calling "mootrain". See the mootm(1) manpage for details.
Note that training a model from a "well-done" corpus in the manner described above only makes sense if you plan to pass "medium-rare" files produced by exactly the same analyzer to "moot" when tagging new texts. In particular, if not all of the analyses to be passed in "medium-rare" files to the runtime tagger "moot" are encoded in the analysis fst "morph.gfst" (e.g. if some analyses are produced by a preprocessing stage), then such "extra" analyses should also be included in the "well-done" training corpus. Such an incompatibility between training and runtime analysis formats may seriously degrade tagger performance. Optionally, you can disregard the analysis information present in the model (if any) at tagger runtime by specifying the --use-classes=0
option to "moot".
Assume you have some raw (unformatted) text to be tagged in the file "test.txt". Before the text can be tagged, it must first be split into indivdual tokens. The moot utilities contain a rudimentary preprocessor, mootpp, to perform this task. The incantation:
mootpp -o test.t test.txt
will produce a "rare" cooked file "test.t" suitable for passing to the tagger or to an external analysis program.
If you have an external analysis program such as mootm(1) which assigns (possibly empty) lexical classes to input tokens, and if your model contains lexical class information (i.e. if you trained from a "well done" corpus analyzed by the same program, and if the file "corpus.clx" contains entries for more than one lexical class), then you may at this point wish to filter "test.t" through your analysis program, yielding a 'medium-rare' file "test.mr".
For analyzing incoming tokens with the "mootm" program built with libgfsm support, using an analysis transducer "morph.gfst" and analysis labels "morph.lab", the appropriate incantation is:
mootm -m morph.gfst -s morph.lab -a -o test.mr test.t
See the mootm(1) manpage for details.
Having trained a model, as well as tokenized (and optionally analyzed) your input file, you are now ready to call the runtime tagger, moot.
If you are not using an external analysis program, and if you have a trained model in the files "corpus.lex" and "corpus.123", as well as a rare cooked file "test.t" to be tagged, then:
moot --model=corpus --use-classes=0 -o test.tt test.t
will produce a "medium cooked" (tagged) output file "test.tt".
If you are using an external analysis program such as mootm(1), and if you have a trained model in the files "corpus.lex", "corpus.123", and "corpus.clx", as well as a medium-rare cooked file "test.mr" to be tagged, then:
moot --model=corpus --use-classes=1 -o test.wd test.mr
will produce a "well done" (+tagged,+analyzed) output file "test.wd".
In the course of model development, it is customary to reserve a small portion of the hand-tagged training corpus for testing. If you have such a medium cooked file "test.ttt" tagged with the "truth", as well as a moot output file "test.tt" for the same tokens, you can check the accuracy of the tagger with the program mooteval.
mooteval -2 test.ttt test.tt
%% File : corpus.ttt
%% Description: demonstration corpus for moot tutorial: +tagged,-analyzed
%% Sentence 1
This DD
is VBZ
a AT
sentence NN
. $.
%% Sentence 2
This DD
is VBZ
another DD
sentence NN
. $.
%% Sentence 3
Here RL
is VBZ
a AT
3rd MD
sentence NN
also RR
. $.
%% File : corpus.mttt
%% Description: demonstration corpus for moot tutorial: +tagged,+analyzed
%% Sentence 1
This DD [AT] [DD]
is VBZ [VBZ]
a AT [AT]
sentence NN [NN] [VBZ]
. $. [$.]
%% Sentence 2
This DD [DD] [AT]
is VBZ [VBZ]
another DD [PP] [NN]
sentence NN [NN] [VBZ]
. $. [$.]
%% Sentence 3
Here RL [RL] [ADV]
is VBZ [VBZ]
a AT [AT]
3rd MD
sentence NN [NN] [VBZ]
also RR [RR]
. $. [$.]
This is a test. This is ONLY a test.
This
is
a
test
.
This
is
ONLY
a
test
.
This [AT] [DD]
is [VBZ]
a [AT]
test [NN] [VBZ]
. [$.]
This [AT] [DD]
is [VBZ]
ONLY
a [AT]
test [NN] [VBZ]
. [$.]
Bryan Jurish <jurish@uni-potsdam.de>
mootutils(1)