mootrain - moocow's part-of-speech tagger : HMM trainer
mootrain [OPTIONS] INPUT(s)
Arguments: INPUT(s) Tagged input corpus file(s).
Options -h --help Print help and exit. -V --version Print version and exit. -cFILE --rcfile=FILE Read an alternate configuration file.
Basic Options -vLEVEL --verbose=LEVEL Verbosity level. -oSTRING --output=STRING Specify basename for output files (default=INPUT) -IFORMAT --input-format=FORMAT Specify input file(s) format(s). --input-encoding=ENCODING Override document encoding for XML input.
Model Format Options -l --lex Generate only lexical frequency file. -n --ngrams Generate only n-gram frequency file. -C --classes Generate only lexical-class frequency file. -eTAG --eos-tag=TAG Specify boundary tag (default=__$) -N --verbose-ngrams Generate long-form ngrams (default=no)
moocow's part-of-speech tagger : HMM trainer
'mootrain' gathers training data for the HMM part-of-speech tagger used by the 'moot' program from a tagged training corpus. The training corpus should be in 'medium' (+tagged,-analyzed) or 'well done' (+tagged,+analyzed) format. The output file(s) are text-format raw frequency models.
See the mootfiles manpage for details on moot file formats.
INPUT(s)
Tagged input corpus file(s).
Input files should be 'medium' (+tagged,-analyzed) or 'well done' (+tagged,-analyzed).
See the mootfiles manpage for details on moot file formats.
--help
, -h
Print help and exit.
Default: '0'
--version
, -V
Print version and exit.
Default: '0'
--rcfile=FILE
, -cFILE
Read an alternate configuration file.
Default: 'NULL'
See also: CONFIGURATION FILES.
--verbose=LEVEL
, -vLEVEL
Verbosity level.
Default: '2'
Be more or less verbose. Recognized values are in the range 0..3.
--output=STRING
, -oSTRING
Specify basename for output files (default=INPUT)
Default: 'NULL'
--input-format=FORMAT
, -IFORMAT
Specify input file(s) format(s).
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.
Default='WellDone'
Implied='Tagged'
See 'I/O Format Flags' in the mootfiles manpage for details.
--input-encoding=ENCODING
Override document encoding for XML input.
Default: 'NULL'
Potentially useful for XML documents without encoding declarations.
--lex
, -l
Generate only lexical frequency file.
Default: '0'
--ngrams
, -n
Generate only n-gram frequency file.
Default: '0'
--classes
, -C
Generate only lexical-class frequency file.
Default: '0'
--eos-tag=TAG
, -eTAG
Specify boundary tag (default=__$)
Default: '__$'
This is the pseudo-tag used in the n-gram model file to represent sentence boundaries, both beginning- and end-of-sentence. It should not be an element of the actual tag-set -- that is, it should not be a valid analysis for any token.
--verbose-ngrams
, -N
Generate long-form ngrams (default=no)
Default: '0'
Configuration files are expected to contain lines of the form:
LONG_OPTION_NAME OPTION_VALUE
where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.
The following configuration files are read by default:
Documentation file auto-generated by optgen.perl version 0.06 using Getopt::Gen version 0.13. Translation was initiated as:
optgen.perl -l --nocfile --nohfile --notimestamp -F mootrain mootrain.gog
Only ca. 99.998% compatible with tnt-para(1), due to token-typification strangeness.
Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.
I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.
Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which this package could not have been built.
Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.
Bryan Jurish <moocow@ling.uni-potsdam.de>
the mootfiles manpage,
mootm(1)
,
the mootcompile manpage,
the moot manpage