mootpp - Tokenizer for moocow's part-of-speech tagger.
mootpp [OPTIONS] FILE(s)
Arguments: FILE(s) Input files
Options -h --help Print help and exit. -V --version Print version and exit. -cFILE --rcfile=FILE Read an alternate configuration file. -vLEVEL --verbose=LEVEL Verbosity level. -oFILE --output=FILE Write output to FILE. -l --list Arguments are input-file lists. -OFORMAT --output-format=FORMAT Specify output file format.
Tokenizer for moocow's part-of-speech tagger.
mootpp is a rudimentary pre-processor for raw text intended for use with the 'moot' part-of-speech tagging tools. It takes as its input one or more 'raw' files, and produces a 'rare' output file. Most SGML markup should be eliminated by mootpp. See the mootfiles manpage for details on moot file formats.
FILE(s)
See also the --list option.
--help
, -h
Default: '0'
--version
, -V
Default: '0'
--rcfile=FILE
, -cFILE
Default: 'NULL'
See also: CONFIGURATION FILES.
--verbose=LEVEL
, -vLEVEL
Default: '1'
Range: 0..1
--output=FILE
, -oFILE
Default: '-'
Output files are in 'rare' format: one token per line, a blank line indicates a sentence boundary.
--list
, -l
Default: '0'
If this flag is given, the FILE(s)
arguments should be lists
of input filenames, one filename per line, which should be
processed. Otherwise, the FILE(s)
arguments are interpreted
as filenames of the input files themselves.
--output-format=FORMAT
, -OFORMAT
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.
Default='Rare'
See 'I/O Format Flags' in the mootfiles manpage for details.
Configuration files are expected to contain lines of the form:
LONG_OPTION_NAME OPTION_VALUE
where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.
No configuration files are read by default.
When writing in XML format, you should first ensure that your input data is properly encoded in UTF-8.
Documentation file auto-generated by optgen.perl version 0.04. Translation was initiated on Mon Jun 27 13:02:43 CEST 2005 as:
/usr/bin/optgen.perl -l --nocfile --nohfile -F mootpp mootpp.gog
Unknown.
Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( ``collocations in the dictionary'', http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( ``digital dictionary of the German language of the 20th century'', http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.
I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.
Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which this package could not have been built.
Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.
Bryan Jurish <moocow@ling.uni-potsdam.de>
the mootfiles manpage, mootm(1), the moot manpage, the mootchurn manpage