moot TUTORIAL

This document provides a rudimentary introduction to the moot PoS tagging utilities. It should under no circumstances be considered a substitute for the individual program documentation.

The Big Idea

The moot utilities are designed for Part-of-Speech Tagging: assigning a single univocal tag to each input token. The runtime tagger (moot) determines which tags to assign by (sequence) maximization of uni-, bi-, and trigram probabilities, as well as lexical probabilties and optional lexical-class probabilities. Probability data is passed to the runtime tagger in a model. Runtime data to the tagger must be tokenized ("cooked"), and may optionally include for each input token a set of possible analyses (a lexical class) for that token.

Training

Before an input file can be tagged, statistical data in the form of a model must first be provided. The easiest way to produce such a model is by using the mootrain utility to gather frequency data from a pre-tagged corpus, thus inducing a maximum-likelihood model.

Training from a Tagged Corpus

If you have a text-format corpus in the file "corpus.ttt" which is pre-tagged with the "correct" part-of-speech tags, then the incantation:

 mootrain --lex --ngrams corpus.ttt

will produce the model files "corpus.lex" and "corpus.123" required for runtime tagging of tokens cooked "rare".

Training from a Tagged and Analyzed Corpus

If you have a text-format corpus in the file "corpus.wdt" which is pre-tagged with the "correct" part-of-speech tags as well as lexical classes, then the incantation:

 mootrain --lex --ngrams --classes corpus.wdt

will produce the model files "corpus.lex", "corpus.123", and "corpus.clx" required for runtime tagging of tokens cooked "medium rare".

If you are using mootm(1) to analyze incoming tokens, but your training corpus "corpus.ttt" contains only tags (and not analyses), you may generate an appropriate "well done" corpus file "corpus.wdt" from "corpus.ttt" by calling:

 mootm -12 -m morph.gfst -s morph.lab -a -o corpus.wdt corpus.ttt

before calling "mootrain". See the mootm(1) manpage for details.

Caveat Praeceptor

Note that training a model from a "well-done" corpus in the manner described above only makes sense if you plan to pass "medium-rare" files produced by exactly the same analyzer to "moot" when tagging new texts. In particular, if not all of the analyses to be passed in "medium-rare" files to the runtime tagger "moot" are encoded in the analysis fst "morph.gfst" (e.g. if some analyses are produced by a preprocessing stage), then such "extra" analyses should also be included in the "well-done" training corpus. Such an incompatibility between training and runtime analysis formats may seriously degrade tagger performance. Optionally, you can disregard the analysis information present in the model (if any) at tagger runtime by specifying the --use-classes=0 option to "moot".

Tokenization

Assume you have some raw (unformatted) text to be tagged in the file "test.txt". Before the text can be tagged, it must first be split into indivdual tokens. The moot utilities contain a rudimentary preprocessor, mootpp, to perform this task. The incantation:

 mootpp -o test.t test.txt

will produce a "rare" cooked file "test.t" suitable for passing to the tagger or to an external analysis program.

Analysis (Optional)

If you have an external analysis program such as mootm(1) which assigns (possibly empty) lexical classes to input tokens, and if your model contains lexical class information (i.e. if you trained from a "well done" corpus analyzed by the same program, and if the file "corpus.clx" contains entries for more than one lexical class), then you may at this point wish to filter "test.t" through your analysis program, yielding a 'medium-rare' file "test.mr".

For analyzing incoming tokens with the "mootm" program built with libgfsm support, using an analysis transducer "morph.gfst" and analysis labels "morph.lab", the appropriate incantation is:

 mootm -m morph.gfst -s morph.lab -a -o test.mr test.t

See the mootm(1) manpage for details.

Tagging

Having trained a model, as well as tokenized (and optionally analyzed) your input file, you are now ready to call the runtime tagger, moot.

Tagging: Rare Cooked Input

If you are not using an external analysis program, and if you have a trained model in the files "corpus.lex" and "corpus.123", as well as a rare cooked file "test.t" to be tagged, then:

 moot --model=corpus --use-classes=0 -o test.tt test.t

will produce a "medium cooked" (tagged) output file "test.tt".

Tagging: Medium Rare Cooked Input

If you are using an external analysis program such as mootm(1), and if you have a trained model in the files "corpus.lex", "corpus.123", and "corpus.clx", as well as a medium-rare cooked file "test.mr" to be tagged, then:

 moot --model=corpus --use-classes=1 -o test.wd test.mr

will produce a "well done" (+tagged,+analyzed) output file "test.wd".

Evaluation

In the course of model development, it is customary to reserve a small portion of the hand-tagged training corpus for testing. If you have such a medium cooked file "test.ttt" tagged with the "truth", as well as a moot output file "test.tt" for the same tokens, you can check the accuracy of the tagger with the program mooteval.

 mooteval -2 test.ttt test.tt

Example Files

corpus.ttt : medium cooked file

 %% File       : corpus.ttt
 %% Description: demonstration corpus for moot tutorial: +tagged,-analyzed
 
 %% Sentence 1
 This           DD
 is             VBZ
 a              AT
 sentence       NN
 .              $.
 
 %% Sentence 2
 This           DD
 is             VBZ
 another        DD
 sentence       NN
 .              $.
 
 %% Sentence 3
 Here           RL
 is             VBZ
 a              AT
 3rd            MD
 sentence       NN
 also           RR
 .              $.

corpus.wdt : well done cooked file

 %% File       : corpus.mttt
 %% Description: demonstration corpus for moot tutorial: +tagged,+analyzed
 
 %% Sentence 1
 This           DD      [AT]    [DD]
 is             VBZ     [VBZ]
 a              AT      [AT]
 sentence       NN      [NN]    [VBZ]
 .              $.      [$.]
 
 %% Sentence 2
 This           DD      [DD]    [AT]
 is             VBZ     [VBZ]
 another        DD      [PP]    [NN]
 sentence       NN      [NN]    [VBZ]
 .              $.      [$.]
 
 %% Sentence 3
 Here           RL      [RL]    [ADV]
 is             VBZ     [VBZ]
 a              AT      [AT]
 3rd            MD
 sentence       NN      [NN]    [VBZ]
 also           RR      [RR]
 .              $.      [$.]

test.txt : raw text file

 This is a test.  This is ONLY a test.

test.t : rare cooked file

 This
 is
 a
 test
 .
 
 This
 is
 ONLY
 a
 test
 .

test.mr : medium-rare cooked file

 This   [AT]    [DD]
 is     [VBZ]
 a      [AT]
 test   [NN]    [VBZ]
 .      [$.]
 
 This   [AT]    [DD]
 is     [VBZ]
 ONLY
 a      [AT]
 test   [NN]    [VBZ]
 .      [$.]

AUTHOR

Bryan Jurish <jurish@uni-potsdam.de>