NAME

moottaste - moocow's HMM part-of-speech tagger: heuristic token classifier.

SYNOPSIS

moottaste [OPTIONS] INPUT(s)

 Arguments:
    INPUT(s)  Input files / file-lists.

 Options
    -h          --help                      Print help and exit.
    -V          --version                   Print version and exit.
    -cFILE      --rcfile=FILE               Read an alternate configuration file.
    -vLEVEL     --verbose=LEVEL             Verbosity level.
    -B          --no-banner                 Suppress initial banner message (implied at verbosity levels <= 2)
    -fFILE      --flavors=FILE              Use flavor heuristics from FILE (default=built-in).
    -FLABEL     --default-flavor=LABEL      Use LABEL as the default flavor (default=empty string or from flavor-file).
    -l          --list                      INPUTs are file-lists, not filenames.
    -oFILE      --output=FILE               Specify output file (default=stdout).

 Format Options
    -IFORMAT    --input-format=FORMAT       Specify input file(s) format(s).
    -OFORMAT    --output-format=FORMAT      Specify output file format.
                --input-encoding=ENCODING   Override XML document input encoding.
                --output-encoding=ENCODING  Set default XML output encoding.

DESCRIPTION

moocow's HMM part-of-speech tagger: heuristic token classifier.

'moottaste' shows the 'flavors' of its input tokens, as determined by heuristic regular-expression-based rules. Mainly useful for debugging.

It takes as its input one or more 'rare' (-tagged,-analyzed) or 'medium rare' (-tagged,+analyzed) files and produces a 'medium' (+tagged,-analyzed) or 'well done' (+tagged,+analyzed) file, respectively. See mootfiles for details on moot file formats.

ARGUMENTS

INPUT(s)

Input files / file-lists.

Input files should be 'cooked' text files of either the 'rare' (-tagged,-analyzed) or 'medium rare' (-tagged,+analyzed) variety.

See also the '--list' option.

For details on moot file formats, see mootfiles.

OPTIONS

--help , -h

Print help and exit.

Default: '0'

--version , -V

Print version and exit.

Default: '0'

--rcfile=FILE , -cFILE

Read an alternate configuration file.

Default: 'NULL'

See also: "CONFIGURATION FILES".

--verbose=LEVEL , -vLEVEL

Verbosity level.

Default: '3'

Be more or less verbose. Recognized values are in the range 0..6:

0 (silent)

Disable all diagnostic messages.

1 (errors)

Print error messages to stderr.

2 (warnings)

Print warnings to stderr.

3 (info)

Print general diagnostic information to stderr.

4 (progress)

Print progress information to stderr.

5 (debug)

Print debugging information to stderr (if applicable).

6 (trace)

Print execution trace information to stderr (if applicable).

--no-banner , -B

Suppress initial banner message (implied at verbosity levels <= 2)

Default: '0'

--flavors=FILE , -fFILE

Use flavor heuristics from FILE (default=built-in).

Default: 'NULL'

If specified, FILE should be a list of token classification rules, one rule per line, in order of decreasing precedence. Each line is a TAB-separated list whose first field is a symbolic flavor label (conventionally beginning with the character '@'), and whose second field is a regular expression.

--default-flavor=LABEL , -FLABEL

Use LABEL as the default flavor (default=empty string or from flavor-file).

Default: 'NULL'

--list , -l

INPUTs are file-lists, not filenames.

Default: '0'

Useful for large batch-processing jobs.

--output=FILE , -oFILE

Specify output file (default=stdout).

Default: '-'

Format Options

--input-format=FORMAT , -IFORMAT

Specify input file(s) format(s).

Default: 'NULL'

Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.

Default='MediumRare'

See 'I/O Format Flags' in mootfiles for details.

--output-format=FORMAT , -OFORMAT

Specify output file format.

Default: 'NULL'

Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.

Default='WellDone'

See 'I/O Format Flags' in mootfiles for details.

--input-encoding=ENCODING

Override XML document input encoding.

Default: 'NULL'

Potentially useful for XML documents without encoding declarations.

--output-encoding=ENCODING

Set default XML output encoding.

Default: 'NULL'

Slower, but potentially useful for human-readable XML output.

CONFIGURATION FILES

Configuration files are expected to contain lines of the form:

    LONG_OPTION_NAME    OPTION_VALUE

where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.

The following configuration files are read by default:

ADDENDA

About this Document

Documentation file auto-generated by optgen.perl version 0.15 using Getopt::Gen version 0.15. Translation was initiated as:

   optgen.perl -l --nocfile --nohfile --notimestamp -F moottaste moottaste.gog

BUGS AND LIMITATIONS

None known.

ACKNOWLEDGEMENTS

Initial development of the this was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government. Development of the DynHMM and WASTE extensions was supported by the DFG-funded projects 'Deutsches Textarchiv' ( "German text archive", http://www.deutschestextarchiv.de ) and 'DLEX' at the Berlin-Brandenburgische Akademie der Wissenschaften.

The authors are grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package. Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which moot could not have been built. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

moot mootfiles mootpp, mootm(1), mootrain, mootcompile, mootdump, mooteval,