This document provides a rudimentary introduction to the moot PoS tagging utilities. It should under no circumstances be considered a substitute for the individual program documentation.
The moot utilities are designed for Part-of-Speech Tagging: assigning a single univocal "tag" to each input token. The runtime tagger (moot) determines which tags to assign by (sequence) maximization of uni-, bi-, and trigram probabilities, as well as lexical probabilties and optionally lexical-class probabilities. Probability data is passed to the runtime tagger in a "model". Runtime data to the tagger must be tokenized ("cooked"), and may optionally for each input token a set of possible analyses (a "lexical class") for that token.
Before an input file can be tagged, statistical data in the form of a model must first be provided. The easiest way to produce such a model is by using the "mootrain" (the mootrain manpage) utility to gather frequency data from a pre-tagged corpus, thus inducing a maximum-likelihood model.
If you have a text-format corpus in the file "corpus.ttt" which is pre-tagged with the "correct" part-of-speech tags, then the incantation:
mootrain --lex --ngrams corpus.ttt
will produce the model files "corpus.lex" and "corpus.123" required for runtime tagging of tokens cooked "rare".
If you have a text-format corpus in the file "corpus.wdt" which is pre-tagged with the "correct" part-of-speech tags as well as lexical classes, then the incantation:
mootrain --lex --ngrams --classes corpus.wdt
will produce the model files "corpus.lex", "corpus.123", and "corpus.clx" required for runtime tagging of tokens cooked "medium rare".
If you are using mootm(1)
to analyze incoming tokens,
but your training corpus "corpus.ttt" contains only
tags (and not analyses), you may generate an appropriate
"well done" corpus file "corpus.wdt" from "corpus.ttt"
by calling:
mootm -12 -m morph.gfst -s morph.lab -a -o corpus.wdt corpus.ttt
before calling "mootrain". See the mootm(1)
manpage for details.
Note that training a model from a "well-done" corpus in the manner
described above
only makes sense if you plan to pass "medium-rare" files
produced by exactly the same analyzer to "moot"
when tagging new texts.
In particular, if not all of the analyses to be passed in "medium-rare" files
to the runtime tagger "moot" are encoded in the analysis fst "morph.gfst"
(e.g. if some analyses are produced by a preprocessing stage), then such
"extra" analyses should also be included in the "well-done" training corpus.
Such an incompatibility between training and runtime analysis formats may
seriously degrade tagger performance.
Optionally, you can disregard the analysis information
present in the model (if any) at tagger runtime by specifying the --use-classes=0
option to "moot".
Assume you have some raw (unformatted) text to be tagged in the file "test.txt". Before the text can be tagged, it must first be split into indivdual tokens. The moot utilities contain a rudimentary preprocessor, "mootpp" (the mootpp manpage), to perform this task. The incantation:
mootpp -o test.t test.txt
will produce a "rare" cooked file "test.t" suitable for passing to the tagger or to an external analysis program.
If you have an external analysis program such as mootm(1)
which assigns (possibly empty) lexical classes to input tokens,
and if your model contains lexical class information (i.e. if
you trained from a "well done" corpus analyzed by the same program,
and if the file "corpus.clx"
contains entries for more than one lexical class), then you
may at this point wish to filter "test.t" through your analysis
program, yielding a 'medium-rare' file "test.mr".
For analyzing incoming tokens with the "mootm" program built with libgfsm support, using an analysis transducer "morph.gfst" and analysis labels "morph.lab", the appropriate incantation is:
mootm -m morph.gfst -s morph.lab -a -o test.mr test.t
See the mootm(1)
manpage for details.
Having trained a model, as well as tokenized (and optionally analyzed) your input file, you are now ready to call the runtime tagger, moot (the moot manpage).
If you are not using an external analysis program, and if you have a trained model in the files "corpus.lex" and "corpus.123", as well as a rare cooked file "test.t" to be tagged, then:
moot --model=corpus --use-classes=0 -o test.tt test.t
will produce a "medium cooked" (tagged) output file "test.tt".
If you are using an external analysis program such as mootm(1)
,
and if you have a trained model in the files "corpus.lex",
"corpus.123", and "corpus.clx", as well as a medium-rare cooked file "test.mr"
to be tagged, then:
moot --model=corpus --use-classes=1 -o test.wd test.mr
will produce a "well done" (+tagged,+analyzed) output file "test.wd".
In the course of model development, it is customary to reserve a small portion of the hand-tagged training corpus for testing. If you have such a medium cooked file "test.ttt" tagged with the "truth", as well as a moot output file "test.tt" for the same tokens, you can check the accuracy of the tagger with the program "mooteval" (the mooteval manpage):
mooteval -2 test.ttt test.tt
%% File : corpus.ttt %% Description: demonstration corpus for moot tutorial: +tagged,-analyzed %% Sentence 1 This DD is VBZ a AT sentence NN . $. %% Sentence 2 This DD is VBZ another DD sentence NN . $. %% Sentence 3 Here RL is VBZ a AT 3rd MD sentence NN also RR . $.
%% File : corpus.mttt %% Description: demonstration corpus for moot tutorial: +tagged,+analyzed %% Sentence 1 This DD [AT] [DD] is VBZ [VBZ] a AT [AT] sentence NN [NN] [VBZ] . $. [$.] %% Sentence 2 This DD [DD] [AT] is VBZ [VBZ] another DD [PP] [NN] sentence NN [NN] [VBZ] . $. [$.] %% Sentence 3 Here RL [RL] [ADV] is VBZ [VBZ] a AT [AT] 3rd MD sentence NN [NN] [VBZ] also RR [RR] . $. [$.]
This is a test. This is ONLY a test.
This is a test . This is ONLY a test .
This [AT] [DD] is [VBZ] a [AT] test [NN] [VBZ] . [$.] This [AT] [DD] is [VBZ] ONLY a [AT] test [NN] [VBZ] . [$.]
Bryan Jurish <jurish@uni-potsdam.de>
mootutils(1)