NAME

mootpp - Rudimentary tokenizer for moocow's part-of-speech tagger.

SYNOPSIS

mootpp [OPTIONS] FILE(s)

 Arguments:
    FILE(s)  Input files

 Options
    -h        --help                  Print help and exit.
    -V        --version               Print version and exit.
    -cFILE    --rcfile=FILE           Read an alternate configuration file.
    -vLEVEL   --verbose=LEVEL         Verbosity level.
    -B        --no-banner             Suppress initial banner message (implied at verbosity levels <= 2)
    -oFILE    --output=FILE           Write output to FILE.
    -l        --list                  Arguments are input-file lists.
    -r        --recover               Attempt to recover from minor errors.
    -OFORMAT  --output-format=FORMAT  Specify output file format.

DESCRIPTION

Rudimentary tokenizer for moocow's part-of-speech tagger.

mootpp is a rudimentary pre-processor for raw text intended for use with the 'moot' part-of-speech tagging tools. It takes as its input one or more 'raw' files, and produces a 'rare' output file. Most SGML markup should be eliminated by mootpp. See mootfiles for details on moot file formats.

ARGUMENTS

FILE(s)

Input files

See also the --list option.

OPTIONS

--help , -h

Print help and exit.

Default: '0'

--version , -V

Print version and exit.

Default: '0'

--rcfile=FILE , -cFILE

Read an alternate configuration file.

Default: 'NULL'

See also: "CONFIGURATION FILES".

--verbose=LEVEL , -vLEVEL

Verbosity level.

Default: '3'

Be more or less verbose. Recognized values are in the range 0..6:

0 (silent)

Disable all diagnostic messages.

1 (errors)

Print error messages to stderr.

2 (warnings)

Print warnings to stderr.

3 (info)

Print general diagnostic information to stderr.

4 (progress)

Print progress information to stderr.

5 (debug)

Print debugging information to stderr (if applicable).

6 (trace)

Print execution trace information to stderr (if applicable).

--no-banner , -B

Suppress initial banner message (implied at verbosity levels <= 2)

Default: '0'

--output=FILE , -oFILE

Write output to FILE.

Default: '-'

Output files are in 'rare' format: one token per line, a blank line indicates a sentence boundary.

--list , -l

Arguments are input-file lists.

Default: '0'

If this flag is given, the FILE(s) arguments should be lists of input filenames, one filename per line, which should be processed. Otherwise, the FILE(s) arguments are interpreted as filenames of the input files themselves.

--recover , -r

Attempt to recover from minor errors.

Default: '0'

Minor errors such as missing files, etc. cause an error message to be emitted but do not cause the program to abort if this option is specified. Useful for large automated batch-processing jobs.

--output-format=FORMAT , -OFORMAT

Specify output file format.

Default: 'NULL'

Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.

Default='Rare'

See 'I/O Format Flags' in mootfiles for details.

CONFIGURATION FILES

Configuration files are expected to contain lines of the form:

    LONG_OPTION_NAME    OPTION_VALUE

where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.

No configuration files are read by default.

ADDENDA

Caveats

When writing in XML format, you should first ensure that your input data is properly encoded in UTF-8.

About this Document

Documentation file auto-generated by optgen.perl version 0.15 using Getopt::Gen version 0.15. Translation was initiated as:

   optgen.perl -l --nocfile --nohfile --notimestamp -F mootpp mootpp.gog

BUGS AND LIMITATIONS

Unknown.

ACKNOWLEDGEMENTS

Initial development of the this was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government. Development of the DynHMM and WASTE extensions was supported by the DFG-funded projects 'Deutsches Textarchiv' ( "German text archive", http://www.deutschestextarchiv.de ) and 'DLEX' at the Berlin-Brandenburgische Akademie der Wissenschaften.

The authors are grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package. Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which moot could not have been built. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

mootfiles, mootm(1), moot, mootchurn