mootdyn - moocow's dynamic HMM part-of-speech tagger/disambiguator.
mootdyn [OPTIONS] INPUT(s)
Arguments:
INPUT(s) Input files / file-lists.
Options
-h --help Print help and exit.
-V --version Print version and exit.
-cFILE --rcfile=FILE Read an alternate configuration file.
-vLEVEL --verbose=LEVEL Verbosity level.
--no-banner Suppress initial banner message (implied at verbosity levels <= 2)
-H --no-header Suppres leading comments in destination file.
-dNTOKS --dots=NTOKS Print a dot for every NTOKS tokens processed.
-l --list INPUTs are file-lists, not filenames.
-r --recover Attempt to recover from minor errors.
-oFILE --output=FILE Specify output file (default=stdout).
Format Options
-IFORMAT --input-format=FORMAT Specify input file(s) format(s).
-OFORMAT --output-format=FORMAT Specify output file format.
--input-encoding=ENCODING Override XML document input encoding.
--output-encoding=ENCODING Set default XML output encoding.
HMM Options
-MMODEL --model=MODEL Use HMM model file(s) MODEL.
-gBOOL --hash-ngrams=BOOL Whether to hash stored n-grams (default=yes)
-aLEN --trie-depth=LEN Maximum depth of suffix trie.
-AFREQ --trie-threshhold=FREQ Frequency upper bound for trie inclusion.
--trie-theta=FLOAT Suffix backoff coefficient.
-LBOOL --use-classes=BOOL Whether to use lexical class-probabilities.
-RBOOL --relax=BOOL Whether to relax token-tag associability (default=1 (true))
-NFLOATS --nlambdas=FLOATS N-Gram smoothing constants (default=estimate)
-WFLOATS --wlambdas=FLOATS Lexical smoothing constants (default=estimate)
-CFLOATS --clambdas=FLOATS Lexical-class smoothing constants (default=estimate)
-tDOUBLE --unknown-threshhold=DOUBLE Freq. threshhold for 'unknown' lexical probabilities
-TDOUBLE --class-threshhold=DOUBLE Freq. threshhold for 'unknown' class probabilities
-uNAME --unknown-token=NAME Symbolic name of the 'unknown' token
-UNAME --unknown-tag=NAME Symbolic name of the 'unknown' tag
-eTAG --eos-tag=TAG Specify boundary tag (default=__$)
-ZDOUBLE --beam-width=DOUBLE Specify cutoff factor for beam pruning
-S --save-ambiguities Annotate tagged tokens with lexical ambiguities
-m --mark-unknown Mark unknown tokens.
Dynamic HMM Options
-DCLASS --dyn-class=CLASS Specify built-in dynamic estimator (default='Freq')
-iBOOL --dyn-invert=BOOL Estimate p(w|t)~=p(t|w)? (default=1)
-bFLOAT --dyn-base=FLOAT Base for Maxwell-Boltzmann estimator (default=2)
-BFLOAT --dyn-beta=FLOAT Temperature coefficient for Maxwell-Boltzmann estimator (default=1)
-wTAG --dyn-new-tag=TAG Specify pseudo-tag for new analyses (default='@NEW')
-EFLOAT --dyn-freq-eps=FLOAT Specify dynamic lexical pseudo-frequency smoothing constant (default=0.1)
-x --dyn-text-tags Use token text field as n-gram source for MIParser
moocow's dynamic HMM part-of-speech tagger/disambiguator.
'mootdyn' is a dynamic Hidden Markov Model (HMM) tagger / disambiguator program based on the 'libmoot' library.
Calling conventions largely the same as for the moot program.
INPUT(s)
Input files / file-lists.
Input files should be 'cooked' text files of either the 'rare' (-tagged,-analyzed) or 'medium rare' (-tagged,+analyzed) variety.
See also the '--list' option.
For details on moot file formats, see mootfiles.
--help
, -h
Print help and exit.
Default: '0'
--version
, -V
Print version and exit.
Default: '0'
--rcfile=FILE
, -cFILE
Read an alternate configuration file.
Default: 'NULL'
See also: "CONFIGURATION FILES".
--verbose=LEVEL
, -vLEVEL
Verbosity level.
Default: '4'
Be more or less verbose. Recognized values are in the range 0..5:
Be silent.
Print error messages to stderr.
Print warnings to stderr.
Print summary information to stderr.
Print progress information to stderr.
Print everything.
Suppress initial banner message (implied at verbosity levels <= 2)
Default: '0'
--no-header
, -H
Suppres leading comments in destination file.
Default: '0'
Primarily useful as a workaround for nonconformant conservative XML output.
--dots=NTOKS
, -dNTOKS
Print a dot for every NTOKS tokens processed.
Default: '0'
Zero (the default) means that no dots will be printed.
--list
, -l
INPUTs are file-lists, not filenames.
Default: '0'
If this flag is given, the FILE(s) arguments should be lists of input filenames, one filename per line, which should be processed. Otherwise, the FILE(s) arguments are interpreted as filenames of the input files themselves.
Potentially useful for large batch-processing jobs.
--recover
, -r
Attempt to recover from minor errors.
Default: '0'
Minor errors such as missing files, etc. cause an error message to be emitted but do not cause the program to abort if this option is specified.
Potentially useful for large automated batch-processing jobs.
--output=FILE
, -oFILE
Specify output file (default=stdout).
Default: '-'
--input-format=FORMAT
, -IFORMAT
Specify input file(s) format(s).
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.
Default='MediumRare'
See 'I/O Format Flags' in mootfiles for details.
--output-format=FORMAT
, -OFORMAT
Specify output file format.
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.
Default='WellDone'
See 'I/O Format Flags' in mootfiles for details.
--input-encoding=ENCODING
Override XML document input encoding.
Default: 'NULL'
Potentially useful for XML documents without encoding declarations.
--output-encoding=ENCODING
Set default XML output encoding.
Default: 'NULL'
Slower, but potentially useful for human-readable XML output.
--model=MODEL
, -MMODEL
Use HMM model file(s) MODEL.
Default: 'moothmm'
See 'HMM MODEL FILE FORMATS' in mootfiles for details on model file formats.
--hash-ngrams=BOOL
, -gBOOL
Whether to hash stored n-grams (default=yes)
Default: '1'
--trie-depth=LEN
, -aLEN
Maximum depth of suffix trie.
Default: '0'
Use suffixes of up to LEN characters to estimate probabilities of unknown words.
Warning: this feature is EXPERIMENTAL! Use at your own risk.
--trie-threshhold=FREQ
, -AFREQ
Frequency upper bound for trie inclusion.
Default: '10'
Use words of at most frequency FREQ to construct the suffix trie.
--trie-theta=FLOAT
Suffix backoff coefficient.
Default: '0'
Specify suffix-trie backoff coefficient for smoothing. Specifying a value of zero (the default) causes the smoothing coefficient to be estimated.
--use-classes=BOOL
, -LBOOL
Whether to use lexical class-probabilities.
Default: '1'
Only useful if your file contains a priori analyses. Default behavior is to try and use classes if you specify a non-empty class-frequency file.
--relax=BOOL
, -RBOOL
Whether to relax token-tag associability (default=1 (true))
Default: '1'
If nonzero, 'tag' fields of token analyses will be used only as a potential estimator of lexical probability, if at all. Otherwise (regardless of whether lexical classes are are being used as a probability estimator), 'tag' fields of token analyses will be interpreted as imposing 'hard' restrictions on which tags may occur with the token in question.
See the --use-classes=BOOL
option and/or mootfiles for more details on the use of lexical classes.
--nlambdas=FLOATS
, -NFLOATS
N-Gram smoothing constants (default=estimate)
Default: 'NULL'
FLOATS should be a string of the form "LAMBDA1,LAMBDA2,LAMBDA3" (without the quotes), where each LAMBDA$i is a floating-point constant.
is the constant smoothing coefficient for unigram probabilities,
is the constant smoothing coefficient for bigram probabilities,
is the constant smoothing coefficient for trigram probabilities (only meaningful if libmoot was built with '--enable-trigrams=yes'. See the output of
mootconfig --options
for details.
If you override the default values, you should choose values such that LAMBDA_1 + LAMBDA_2 + LAMBDA_3 == 1.0.
--wlambdas=FLOATS
, -WFLOATS
Lexical smoothing constants (default=estimate)
Default: 'NULL'
FLOATS should be a string of the form "LAMBDA_W0,LAMBDA_W1" (without the quotes), where each LAMBDA_W$i is a floating-point constant.
is the constant minimum lexical probability,
is the constant smoothing coefficient for lexical probabilities.
If you override the default values, you should choose values such that LAMBDA_W0 + LAMBDA_W1 == 1.0.
--clambdas=FLOATS
, -CFLOATS
Lexical-class smoothing constants (default=estimate)
Default: 'NULL'
LAMBDAS should be a string of the form "LAMBDA_C0,LAMBDA_C1" (without the quotes), where each LAMBDA_C$i is a floating-point constant.
is the constant minimum lexical-class probability,
is the constant smoothing coefficient for lexical-class probabilities.
If you override the default values, you should choose values such that LAMBDA_C0 + LAMBDA_C1 == 1.0.
--unknown-threshhold=DOUBLE
, -tDOUBLE
Freq. threshhold for 'unknown' lexical probabilities
Default: '1.0'
Lexical probabilities for unknown tokens in the input are estimated from tokens which occur at most FLOAT times in the model.
--class-threshhold=DOUBLE
, -TDOUBLE
Freq. threshhold for 'unknown' class probabilities
Default: '1.0'
Class probabilities for unrecognized tokens in the input are estimated from classes which occur at most FLOAT times in the model and/or from the empty class.
--unknown-token=NAME
, -uNAME
Symbolic name of the 'unknown' token
Default: '@UNKNOWN'
You can use this value to include lexical frequency information for unknown input tokens in the lexical model file.
--unknown-tag=NAME
, -UNAME
Symbolic name of the 'unknown' tag
Default: 'UNKNOWN'
You should never see or need this tag.
--eos-tag=TAG
, -eTAG
Specify boundary tag (default=__$)
Default: '__$'
This is the pseudo-tag used in the n-gram model file to represent sentence boundaries, both beginning- and end-of-sentence. It should not be an element of the actual tag-set -- that is, it should not be a valid analysis for any token.
--beam-width=DOUBLE
, -ZDOUBLE
Specify cutoff factor for beam pruning
Default: '1000'
During Viterbi search, paths will be ignored if their probabilities are less than 1/NUM*p_best , where p_best is the probability of the current best path. Setting this option to zero disables beam pruning.
--save-ambiguities
, -S
Annotate tagged tokens with lexical ambiguities
Default: '0'
Useful for debugging.
--mark-unknown
, -m
Mark unknown tokens.
Default: '0'
Useful for debugging.
--dyn-class=CLASS
, -DCLASS
Specify built-in dynamic estimator (default='Freq')
Default: 'Freq'
Known values for CLASS are:
Analysis 'costs' are interpreted as pseudo-frequencies f(w,t) lexical probabilities are instantiated as p(w|t)~=f(w,t)/Z(w,t). See --dyn-invert
for details on how Z(w,t)
is estimated.
Analysis 'costs' are interpreted as 'distances' d(w,t), and lexical probabilities are instantiated as a Maxwell-Boltzmann distribution:
f(w,t) ~= BASE ^ (-BETA * d(w,t)) # Maxwell-Boltzmann estimator
p(w|t) ~= f(w,t) / Z(w,t) # ... as for the 'Freq' class
The Maxwell-Boltzmann estimator constants BASE and BETA are given by the --dyn-base
and --dyn-beta
args.
Uses n-gram model data to break input sentences into binary-branching trees. If the --dyn-text-tags
flag is given, n-gram model is assumed to be for token text; otherwise, n-gram model should be for token tags.
--dyn-invert=BOOL
, -iBOOL
Estimate p(w|t)~=p(t|w)? (default=1)
Default: '1'
Determines how the normalization factor Z(w,t) is estimated for dynamic lexical probabilities p(w|t)~=f(w,t)/Z(w,t). If true (the default), Z(w,t) := f(w) = Sum_t f(w,t). Otherwise, Z(w,t) := f(t) = Sum_w f(w,t).
Note that a true value here causes a theoretically incorrect estimator to be used, since f(w,t)/f(w) = p(t|w) != p(w|t). Nonetheless, empirical tests have shown the inverted estimator to be more effective in many cases, and should be too harmful if the input analyses are a function of input token text.
--dyn-base=FLOAT
, -bFLOAT
Base for Maxwell-Boltzmann estimator (default=2)
Default: '2.0'
See the 'Boltzmann' estimator under --dyn-class
for details.
--dyn-beta=FLOAT
, -BFLOAT
Temperature coefficient for Maxwell-Boltzmann estimator (default=1)
Default: '1.0'
See the 'Boltzmann' estimator under --dyn-class
for details.
--dyn-new-tag=TAG
, -wTAG
Specify pseudo-tag for new analyses (default='@NEW')
Default: '@NEW'
This is the pseudo-tag used in the n-gram model file to represent previously unseen tags (if any).
--dyn-freq-eps=FLOAT
, -EFLOAT
Specify dynamic lexical pseudo-frequency smoothing constant (default=0.1)
Default: '0.1'
Use token text field as n-gram source for MIParser
Default: '0'
See the 'MIParser' class under --dyn-class
for details.
Configuration files are expected to contain lines of the form:
LONG_OPTION_NAME OPTION_VALUE
where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.
The following configuration files are read by default:
/etc/mootdynrc
~/.mootdynrc
Documentation file auto-generated by optgen.perl version 0.15 using Getopt::Gen version 0.15. Translation was initiated as:
optgen.perl -l --nocfile --nohfile --notimestamp -F mootdyn mootdyn.gog
None known.
Initial development of the this was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government. Development of the DynHMM and WASTE extensions was supported by the DFG-funded projects 'Deutsches Textarchiv' ( "German text archive", http://www.deutschestextarchiv.de ) and 'DLEX' at the Berlin-Brandenburgische Akademie der Wissenschaften.
The authors are grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package. Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which moot could not have been built. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.
Bryan Jurish <moocow@cpan.org>
mootfiles mootpp, mootm(1), mootrain, mootcompile, mootdump, mooteval,