NAME
SYNOPSIS
DESCRIPTION
ARGUMENTS
OPTIONS
- Format Options
- HMM Options
CONFIGURATION FILES
ADDENDA
- About this Document
BUGS AND LIMITATIONS
ACKNOWLEDGEMENTS
AUTHOR
SEE ALSO

NAME

moot - moocow's HMM part-of-speech tagger/disambiguator.

SYNOPSIS

moot [OPTIONS] INPUT(s)

 Arguments:
    INPUT(s)  Input files / file-lists.

 Options
    -h          --help                       Print help and exit.
    -V          --version                    Print version and exit.
    -cFILE      --rcfile=FILE                Read an alternate configuration file.
    -vLEVEL     --verbose=LEVEL              Verbosity level.
    -B          --no-banner                  Suppress initial banner message (implied at verbosity levels <= 2)
    -H          --no-header                  Suppres leading comments in destination file.
    -dNTOKS     --dots=NTOKS                 Print a dot for every NTOKS tokens processed.
    -l          --list                       INPUTs are file-lists, not filenames.
    -r          --recover                    Attempt to recover from minor errors.
    -S          --stream                     Use stream-wise I/O routines instead of sentence buffers.
    -oFILE      --output=FILE                Specify output file (default=stdout).

 Format Options
    -IFORMAT    --input-format=FORMAT        Specify input file(s) format(s).
    -OFORMAT    --output-format=FORMAT       Specify output file format.
                --input-encoding=ENCODING    Override XML document input encoding.
                --output-encoding=ENCODING   Set default XML output encoding.

 HMM Options
    -MMODEL     --model=MODEL                Use HMM model file(s) MODEL.
    -gBOOL      --hash-ngrams=BOOL           Whether to hash stored n-grams (default=no)
    -aLEN       --trie-depth=LEN             Maximum depth of suffix trie.
    -AFREQ      --trie-threshhold=FREQ       Frequency upper bound for trie inclusion.
                --trie-theta=FLOAT           Suffix backoff coefficient.
    -LBOOL      --use-classes=BOOL           Whether to use lexical class-probabilities.
    -FBOOL      --use-flavors=BOOL           Whether to use token 'flavor' heuristics (default=1 (true))
    -RBOOL      --relax=BOOL                 Whether to relax token-tag associability (default=1 (true))
    -NFLOATS    --nlambdas=FLOATS            N-Gram smoothing constants (default=estimate)
    -WFLOATS    --wlambdas=FLOATS            Lexical smoothing constants (default=estimate)
    -CFLOATS    --clambdas=FLOATS            Lexical-class smoothing constants (default=estimate)
    -tDOUBLE    --unknown-threshhold=DOUBLE  Freq. threshhold for 'unknown' lexical probabilities
    -TDOUBLE    --class-threshhold=DOUBLE    Freq. threshhold for 'unknown' class probabilities
    -uNAME      --unknown-token=NAME         Symbolic name of the 'unknown' token
    -UNAME      --unknown-tag=NAME           Symbolic name of the 'unknown' tag
    -eTAG       --eos-tag=TAG                Specify boundary tag (default=__$)
    -ZDOUBLE    --beam-width=DOUBLE          Specify cutoff factor for beam pruning
                --save-ambiguities           Annotate tagged tokens with lexical ambiguities
    -m          --mark-unknown               Mark unknown tokens.

DESCRIPTION

moocow's HMM part-of-speech tagger/disambiguator.

'moot' is a Hidden Markov Model (HMM) Part-of-Spech (PoS) tagger / disambiguator program based on the 'libmoot' library.

It takes as its input one or more 'rare' (-tagged,-analyzed) or 'medium rare' (-tagged,+analyzed) files and produces a 'medium' (+tagged,-analyzed) or 'well done' (+tagged,+analyzed) file, respectively. See mootfiles for details on moot file formats.

ARGUMENTS

INPUT(s)

Input files / file-lists.

Input files should be 'cooked' text files of either the 'rare' (-tagged,-analyzed) or 'medium rare' (-tagged,+analyzed) variety.

OPTIONS

--help , -h

Print help and exit.

Default: '0'

--version , -V

Print version and exit.

Default: '0'

--rcfile=FILE , -cFILE

Read an alternate configuration file.

Default: 'NULL'

Format Options

--input-format=FORMAT , -IFORMAT

Specify input file(s) format(s).

Default: 'NULL'

Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.

Default='MediumRare'

See 'I/O Format Flags' in mootfiles for details.

--output-format=FORMAT , -OFORMAT

Specify output file format.

Default: 'NULL'

Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.

Default='WellDone'

See 'I/O Format Flags' in mootfiles for details.

--input-encoding=ENCODING

Override XML document input encoding.

Default: 'NULL'

Potentially useful for XML documents without encoding declarations.

--output-encoding=ENCODING

Set default XML output encoding.

Default: 'NULL'

Slower, but potentially useful for human-readable XML output.

HMM Options

--model=MODEL , -MMODEL

Use HMM model file(s) MODEL.

Default: 'moothmm'

See 'HMM MODEL FILE FORMATS' in mootfiles for details on model file formats.

--hash-ngrams=BOOL , -gBOOL

Whether to hash stored n-grams (default=no)

Default: '0'

If specified and true, tag n-grams will be stored in a slow but memory-friendly hash. Otherwise, a fast but large array will be used.

--trie-depth=LEN , -aLEN

Maximum depth of suffix trie.

Default: '0'

Use suffixes of up to LEN characters to estimate probabilities of unknown words.

Warning: this feature is EXPERIMENTAL! Use at your own risk.

--trie-threshhold=FREQ , -AFREQ

Frequency upper bound for trie inclusion.

Default: '10'

Use words of at most frequency FREQ to construct the suffix trie.

--trie-theta=FLOAT

Suffix backoff coefficient.

Default: '0'

Specify suffix-trie backoff coefficient for smoothing. Specifying a value of zero (the default) causes the smoothing coefficient to be estimated.

--use-classes=BOOL , -LBOOL

Whether to use lexical class-probabilities.

Default: '1'

Only useful if your file contains a priori analyses. Default behavior is to try and use classes if you specify a non-empty class-frequency file.

--use-flavors=BOOL , -FBOOL

Whether to use token 'flavor' heuristics (default=1 (true))

Default: '1'

Default behavior is to use flavor heuristics as specified by the model (usually built-in).

--relax=BOOL , -RBOOL

Whether to relax token-tag associability (default=1 (true))

Default: '1'

If nonzero, 'tag' fields of token analyses will be used only as a potential estimator of lexical probability, if at all. Otherwise (regardless of whether lexical classes are are being used as a probability estimator), 'tag' fields of token analyses will be interpreted as imposing 'hard' restrictions on which tags may occur with the token in question.

See the --use-classes=BOOL option and/or mootfiles for more details on the use of lexical classes.

--nlambdas=FLOATS , -NFLOATS

N-Gram smoothing constants (default=estimate)

Default: 'NULL'

FLOATS should be a string of the form "LAMBDA1,LAMBDA2,LAMBDA3" (without the quotes), where each LAMBDA$i is a floating-point constant.

LAMBDA_1

is the constant smoothing coefficient for unigram probabilities,

LAMBDA_2

is the constant smoothing coefficient for bigram probabilities,

LAMBDA_3

is the constant smoothing coefficient for trigram probabilities (only meaningful if libmoot was built with '--enable-trigrams=yes'. See the output of

 mootconfig --options

for details.

If you override the default values, you should choose values such that LAMBDA_1 + LAMBDA_2 + LAMBDA_3 == 1.0.

--wlambdas=FLOATS , -WFLOATS

Lexical smoothing constants (default=estimate)

Default: 'NULL'

FLOATS should be a string of the form "LAMBDA_W0,LAMBDA_W1" (without the quotes), where each LAMBDA_W$i is a floating-point constant.

LAMBDA_W0: is the constant minimum lexical probability,
LAMBDA_W1: is the constant smoothing coefficient for lexical probabilities.

If you override the default values, you should choose values such that LAMBDA_W0 + LAMBDA_W1 == 1.0.

--clambdas=FLOATS , -CFLOATS

Lexical-class smoothing constants (default=estimate)

Default: 'NULL'

LAMBDAS should be a string of the form "LAMBDA_C0,LAMBDA_C1" (without the quotes), where each LAMBDA_C$i is a floating-point constant.

LAMBDA_C0: is the constant minimum lexical-class probability,
LAMBDA_C1: is the constant smoothing coefficient for lexical-class probabilities.

If you override the default values, you should choose values such that LAMBDA_C0 + LAMBDA_C1 == 1.0.

--unknown-threshhold=DOUBLE , -tDOUBLE

Freq. threshhold for 'unknown' lexical probabilities

Default: '1.0'

Lexical probabilities for unknown tokens in the input are estimated from tokens which occur at most FLOAT times in the model.

--class-threshhold=DOUBLE , -TDOUBLE

Freq. threshhold for 'unknown' class probabilities

Default: '1.0'

Class probabilities for unrecognized tokens in the input are estimated from classes which occur at most FLOAT times in the model and/or from the empty class.

--unknown-token=NAME , -uNAME

Symbolic name of the 'unknown' token

Default: '@UNKNOWN'

You can use this value to include lexical frequency information for unknown input tokens in the lexical model file.

--unknown-tag=NAME , -UNAME

Symbolic name of the 'unknown' tag

Default: 'UNKNOWN'

You should never see or need this tag.

--eos-tag=TAG , -eTAG

Specify boundary tag (default=__$)

Default: '__$'

This is the pseudo-tag used in the n-gram model file to represent sentence boundaries, both beginning- and end-of-sentence. It should not be an element of the actual tag-set -- that is, it should not be a valid analysis for any token.

--beam-width=DOUBLE , -ZDOUBLE

Specify cutoff factor for beam pruning

Default: '1000'

During Viterbi search, paths will be ignored if their probabilities are less than 1/NUM*p_best , where p_best is the probability of the current best path. Setting this option to zero disables beam pruning.

--save-ambiguities

Annotate tagged tokens with lexical ambiguities

Default: '0'

Save tag-wise ambiguity probabilities as analyses following a '@@'. Useful for debugging together with the 'Cost' output flag.

--mark-unknown , -m

Mark unknown tokens.

Default: '0'

Mark tokens whose literal text is not known to the lexicon by appending a '*' analysis. Useful for model debugging.

CONFIGURATION FILES

Configuration files are expected to contain lines of the form:

    LONG_OPTION_NAME    OPTION_VALUE

where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.

The following configuration files are read by default:

/etc/mootrc
~/.mootrc

ADDENDA

About this Document

Documentation file auto-generated by optgen.perl version 0.15 using Getopt::Gen version 0.15. Translation was initiated as:

   optgen.perl -l --nocfile --nohfile --notimestamp -F moot moot.gog

BUGS AND LIMITATIONS

None known.

ACKNOWLEDGEMENTS

Initial development of the this was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government. Development of the DynHMM and WASTE extensions was supported by the DFG-funded projects 'Deutsches Textarchiv' ( "German text archive", http://www.deutschestextarchiv.de ) and 'DLEX' at the Berlin-Brandenburgische Akademie der Wissenschaften.

The authors are grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package. Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which moot could not have been built. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.

AUTHOR

Bryan Jurish <moocow@cpan.org>

NAME

SYNOPSIS

DESCRIPTION

ARGUMENTS

OPTIONS

Format Options

HMM Options

CONFIGURATION FILES

ADDENDA

About this Document

BUGS AND LIMITATIONS

ACKNOWLEDGEMENTS

AUTHOR

SEE ALSO