NAME

mootrain - moocow's part-of-speech tagger : HMM trainer

SYNOPSIS

mootrain [OPTIONS] INPUT(s)

 Arguments:
    INPUT(s)  Tagged input corpus file(s).

 Options
    -h          --help                       Print help and exit.
    -V          --version                    Print version and exit.
    -cFILE      --rcfile=FILE                Read an alternate configuration file.
    -vLEVEL     --verbose=LEVEL              Verbosity level.
    -B          --no-banner                  Suppress initial banner message (implied at verbosity levels <= 1)
    -oSTRING    --output=STRING              Specify basename for output files (default=INPUT)
    -IFORMAT    --input-format=FORMAT        Specify input file(s) format(s).
                --input-encoding=ENCODING    Override document encoding for XML input.

 Model Format Options
    -l          --lex                        Generate only lexical frequency file.
    -n          --ngrams                     Generate only n-gram frequency file.
    -C          --classes                    Generate only lexical-class frequency file.
    -F          --flavors                    Generate only flavor heuristic file.
    -eTAG       --eos-tag=TAG                Specify boundary tag (default=__$)
    -N          --verbose-ngrams             Generate long-form ngrams (default=no)
    -fFILE      --flavors-from=FILE          Use flavor heuristics from FILE (default=built-in).
    -tDOUBLE    --unknown-threshhold=DOUBLE  Freq. threshhold for 'unknown' lexical probabilities

DESCRIPTION

moocow's part-of-speech tagger : HMM trainer

'mootrain' gathers training data for the HMM part-of-speech tagger used by the 'moot' program from a tagged training corpus. The training corpus should be in 'medium' (+tagged,-analyzed) or 'well done' (+tagged,+analyzed) format. The output file(s) are text-format raw frequency models.

See mootfiles for details on moot file formats.

ARGUMENTS

INPUT(s)

Tagged input corpus file(s).

Input files should be 'medium' (+tagged,-analyzed) or 'well done' (+tagged,-analyzed).

See mootfiles for details on moot file formats.

OPTIONS

--help , -h

Print help and exit.

Default: '0'

--version , -V

Print version and exit.

Default: '0'

--rcfile=FILE , -cFILE

Read an alternate configuration file.

Default: 'NULL'

See also: "CONFIGURATION FILES".

--verbose=LEVEL , -vLEVEL

Verbosity level.

Default: '3'

Be more or less verbose. Recognized values are in the range 0..6:

0 (silent)

Disable all diagnostic messages.

1 (errors)

Print error messages to stderr.

2 (warnings)

Print warnings to stderr.

3 (info)

Print general diagnostic information to stderr.

4 (progress)

Print progress information to stderr.

5 (debug)

Print debugging information to stderr (if applicable).

6 (trace)

Print execution trace information to stderr (if applicable).

--no-banner , -B

Suppress initial banner message (implied at verbosity levels <= 1)

Default: '0'

--output=STRING , -oSTRING

Specify basename for output files (default=INPUT)

Default: 'NULL'

--input-format=FORMAT , -IFORMAT

Specify input file(s) format(s).

Default: 'NULL'

Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.

Default='WellDone'

Implied='Tagged'

See 'I/O Format Flags' in mootfiles for details.

--input-encoding=ENCODING

Override document encoding for XML input.

Default: 'NULL'

Potentially useful for XML documents without encoding declarations.

Model Format Options

--lex , -l

Generate only lexical frequency file.

Default: '0'

--ngrams , -n

Generate only n-gram frequency file.

Default: '0'

--classes , -C

Generate only lexical-class frequency file.

Default: '0'

--flavors , -F

Generate only flavor heuristic file.

Default: '0'

--eos-tag=TAG , -eTAG

Specify boundary tag (default=__$)

Default: '__$'

This is the pseudo-tag used in the n-gram model file to represent sentence boundaries, both beginning- and end-of-sentence. It should not be an element of the actual tag-set -- that is, it should not be a valid analysis for any token.

--verbose-ngrams , -N

Generate long-form ngrams (default=no)

Default: '0'

--flavors-from=FILE , -fFILE

Use flavor heuristics from FILE (default=built-in).

Default: 'NULL'

If specified, FILE should be a flavor definition file containing a list of regular-expression based token classification rules to be used in computing special entries for the lexical frequency file. See mootfiles(5) for a full specification of the moot flavor definition file format. If unspecified, the default behavior is to use a built-in set of classification heuristics. If FILE is an empty string, no flavor heuristics at all will be applied.

--unknown-threshhold=DOUBLE , -tDOUBLE

Freq. threshhold for 'unknown' lexical probabilities

Default: '1.0'

Setting this option to a non-zero value will case a special @UKNOWN entry to be added to the lexical frequency model file. Note that such an entry will be overridden during model compilation if you specify a non-zero unknown lexical threshhold to moot(1) or mootcompile(1).

CONFIGURATION FILES

Configuration files are expected to contain lines of the form:

    LONG_OPTION_NAME    OPTION_VALUE

where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.

The following configuration files are read by default:

ADDENDA

About this Document

Documentation file auto-generated by optgen.perl version 0.15 using Getopt::Gen version 0.15. Translation was initiated as:

   optgen.perl -l --nocfile --nohfile --notimestamp -F mootrain mootrain.gog

BUGS AND LIMITATIONS

Only ca. 99.998% compatible with tnt-para(1), due to token-typification strangeness.

ACKNOWLEDGEMENTS

Initial development of the this was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government. Development of the DynHMM and WASTE extensions was supported by the DFG-funded projects 'Deutsches Textarchiv' ( "German text archive", http://www.deutschestextarchiv.de ) and 'DLEX' at the Berlin-Brandenburgische Akademie der Wissenschaften.

The authors are grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package. Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which moot could not have been built. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

mootfiles, mootm(1), mootcompile, moot