moot - moocow's HMM part-of-speech tagger/disambiguator.
moot [OPTIONS] INPUT(s)
Arguments: INPUT(s) Input files / file-lists.
Options -h --help Print help and exit. -V --version Print version and exit. -cFILE --rcfile=FILE Read an alternate configuration file. -vLEVEL --verbose=LEVEL Verbosity level. -H --no-header Suppres leading comments in destination file. -dNTOKS --dots=NTOKS Print a dot for every NTOKS tokens processed. -l --list INPUTs are file-lists, not filenames. -oFILE --output=FILE Specify output file (default=stdout).
Format Options -IFORMAT --input-format=FORMAT Specify input file(s) format(s). -OFORMAT --output-format=FORMAT Specify output file format. --input-encoding=ENCODING Override XML document input encoding. --output-encoding=ENCODING Set default XML output encoding.
HMM Options -MMODEL --model=MODEL Use HMM model file(s) MODEL. -aLEN --trie-depth=LEN Maximum depth of suffix trie. -AFREQ --trie-threshhold=FREQ Frequency upper bound for trie inclusion. --trie-theta=FLOAT Suffix backoff coefficient. -LBOOL --use-classes=BOOL Whether to use lexical class-probabilities. -NFLOATS --nlambdas=FLOATS N-Gram smoothing constants (default=estimate) -WFLOATS --wlambdas=FLOATS Lexical smoothing constants (default=estimate) -CFLOATS --clambdas=FLOATS Lexical-class smoothing constants (default=estimate) -tDOUBLE --unknown-threshhold=DOUBLE Freq. threshhold for 'unknown' lexical probabilities -TDOUBLE --class-threshhold=DOUBLE Freq. threshhold for 'unknown' class probabilities -uNAME --unknown-token=NAME Symbolic name of the 'unknown' token -UNAME --unknown-tag=NAME Symbolic name of the 'unknown' tag -eTAG --eos-tag=TAG Specify boundary tag (default=__$) -ZDOUBLE --beam-width=DOUBLE Specify cutoff factor for beam pruning -S --save-ambiguities Annotate tagged tokens with lexical ambiguities -m --mark-unknown Mark unknown tokens.
moocow's HMM part-of-speech tagger/disambiguator.
'moot' is a Hidden Markov Model (HMM) Part-of-Spech (PoS) tagger / disambiguator program based on the 'libmoot' library.
It takes as its input one or more 'rare' (-tagged,-analyzed) or 'medium rare' (-tagged,+analyzed) files and produces a 'medium' (+tagged,-analyzed) or 'well done' (+tagged,+analyzed) file, respectively. See the mootfiles manpage for details on moot file formats.
INPUT(s)
Input files should be 'cooked' text files of either the 'rare' (-tagged,-analyzed) or 'medium rare' (-tagged,+analyzed) variety.
See also the '--list' option.
For details on moot file formats, see the mootfiles manpage.
--help
, -h
Default: '0'
--version
, -V
Default: '0'
--rcfile=FILE
, -cFILE
Default: 'NULL'
See also: CONFIGURATION FILES.
--verbose=LEVEL
, -vLEVEL
Default: '3'
Be more or less verbose. Recognized values are in the range 0..5:
--no-header
, -H
Default: '0'
Primarily useful as a workaround for nonconformant conservative XML output.
--dots=NTOKS
, -dNTOKS
Default: '0'
Zero (the default) means that no dots will be printed.
--list
, -l
Default: '0'
Useful for large batch-processing jobs.
--output=FILE
, -oFILE
Default: '-'
--input-format=FORMAT
, -IFORMAT
file(s)
format(s).
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.
Default='MediumRare'
See 'I/O Format Flags' in the mootfiles manpage for details.
--output-format=FORMAT
, -OFORMAT
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.
Default='WellDone'
See 'I/O Format Flags' in the mootfiles manpage for details.
--input-encoding=ENCODING
Default: 'NULL'
Potentially useful for XML documents without encoding declarations.
--output-encoding=ENCODING
Default: 'NULL'
Slower, but potentially useful for human-readable XML output.
--model=MODEL
, -MMODEL
file(s)
MODEL.
Default: 'moothmm'
See 'HMM MODEL FILE FORMATS' in the mootfiles manpage for details on model file formats.
--trie-depth=LEN
, -aLEN
Default: '0'
Use suffixes of up to LEN characters to estimate probabilities of unknown words.
Warning: this feature is EXPERIMENTAL! Use at your own risk.
--trie-threshhold=FREQ
, -AFREQ
Default: '10'
Use words of at most frequency FREQ to construct the suffix trie.
--trie-theta=FLOAT
Default: '0'
Specify suffix-trie backoff coefficient for smoothing. Specifying a value of zero (the default) causes the smoothing coefficient to be estimated.
--use-classes=BOOL
, -LBOOL
Default: '1'
Only useful if your file contains a priori analyses. Default behavior is to try and use classes if you specify a non-empty class-frequency file.
--nlambdas=FLOATS
, -NFLOATS
Default: 'NULL'
FLOATS should be a string of the form ``LAMBDA1,LAMBDA2,LAMBDA3'' (without the quotes), where each LAMBDA$i is a floating-point constant.
mootconfig --options
for details.
If you override the default values, you should choose values such that LAMBDA_1 + LAMBDA_2 + LAMBDA_3 == 1.0.
--wlambdas=FLOATS
, -WFLOATS
Default: 'NULL'
FLOATS should be a string of the form ``LAMBDA_W0,LAMBDA_W1'' (without the quotes), where each LAMBDA_W$i is a floating-point constant.
If you override the default values, you should choose values such that LAMBDA_W0 + LAMBDA_W1 == 1.0.
--clambdas=FLOATS
, -CFLOATS
Default: 'NULL'
LAMBDAS should be a string of the form ``LAMBDA_C0,LAMBDA_C1'' (without the quotes), where each LAMBDA_C$i is a floating-point constant.
If you override the default values, you should choose values such that LAMBDA_C0 + LAMBDA_C1 == 1.0.
--unknown-threshhold=DOUBLE
, -tDOUBLE
Default: '1.0'
Lexical probabilities for unknown tokens in the input are estimated from tokens which occur at most FLOAT times in the model.
--class-threshhold=DOUBLE
, -TDOUBLE
Default: '1.0'
Class probabilities for unrecognized tokens in the input are estimated from classes which occur at most FLOAT times in the model and/or from the empty class.
--unknown-token=NAME
, -uNAME
Default: '@UNKNOWN'
You can use this value to include lexical frequency information for unknown input tokens in the lexical model file.
--unknown-tag=NAME
, -UNAME
Default: 'UNKNOWN'
You should never see or need this tag.
--eos-tag=TAG
, -eTAG
Default: '__$'
This is the pseudo-tag used in the n-gram model file to represent sentence boundaries, both beginning- and end-of-sentence. It should not be an element of the actual tag-set -- that is, it should not be a valid analysis for any token.
--beam-width=DOUBLE
, -ZDOUBLE
Default: '1000'
During Viterbi search, paths will be ignored if their probabilities are less than 1/NUM*p_best , where p_best is the probability of the current best path. Setting this option to zero disables beam pruning.
--save-ambiguities
, -S
Default: '0'
Useful for debugging.
--mark-unknown
, -m
Default: '0'
Useful for debugging.
Configuration files are expected to contain lines of the form:
LONG_OPTION_NAME OPTION_VALUE
where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.
The following configuration files are read by default:
Documentation file auto-generated by optgen.perl version 0.04. Translation was initiated on Wed Jul 6 12:52:23 CEST 2005 as:
/usr/bin/optgen.perl -l --nocfile --nohfile -F moot moot.gog
None known.
Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( ``collocations in the dictionary'', http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( ``digital dictionary of the German language of the 20th century'', http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.
I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.
Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which this package could not have been built.
Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.
Bryan Jurish <moocow@ling.uni-potsdam.de>
the mootfiles manpage the mootpp manpage, mootm(1), the mootrain manpage, the mootcompile manpage, the mootdump manpage, the mooteval manpage,