waste - Word- and Sentence-Token Extractor using a Hidden Markov Model
waste [OPTIONS] FILE(s)
Arguments:
FILE(s) Input files
Options
-h --help Print help and exit.
-V --version Print version and exit.
-cFILE --rcfile=FILE Read an alternate configuration file.
-vLEVEL --verbose=LEVEL Verbosity level.
-B --no-banner Suppress initial banner message (implied at verbosity levels <= 2)
-l --list Arguments are input-file lists.
-r --recover Attempt to recover from minor errors.
-oFILE --output=FILE Write output to FILE.
Mode Options
-f --full Alias for --scan --lex --tag --decode --annotate (default)
-R --train Training mode (similar to --lex)
-s --scan Enable raw text scanning stage.
-S --no-scan Disable raw text scanning stage.
-x --lex Enable lexical classification stage.
-X --no-lex Disable lexical classification stage.
-t --tag Enable HMM Viterbi tagging stage.
-T --no-tag Disable HMM Viterbi tagging stage.
-d --decode Enable post-Viterbi decoding stage.
-D --no-decode Disable post-Viterbi decoding stage.
-n --annotate Enable text-based annotation stage.
-N --no-annotate Disable text-based annotation stage.
Lexer Options
-aFILE --abbrevs=FILE Load abbreviation lexicon from FILE (1 word/line)
-jFILE --conjunctions=FILE Load conjunction lexicon from FILE (1 word/line)
-wFILE --stopwords=FILE Load stopword lexicon from FILE (1 word/line)
-y --dehyphenate Enable automatic dehyphenation in lexer (default)
-Y --no-dehyphenate Disable automatic dehyphenation in lexer.
HMM Options
-MMODEL --model=MODEL Use HMM tokenizer model MODEL.
Format Options
-IFORMAT --input-format=FORMAT Specify input or --scan mode format
-OFORMAT --output-format=FORMAT Specify output file format.
Word- and Sentence-Token Extractor using a Hidden Markov Model
waste is the top-level command-line interface to the moot/WASTE HMM tokenizer system. It can be used as a complete tokenization pipeline (--full, the default), as an annotator for pre-tokenized training corpora (--train), or as a standalone scanner (--scan), lexical encoder (--lex), HMM disambiguator (--tag), lexical decoder (--decode), lexical annotator (--annotate), or as (almost) any coherent combination of the above components. Input and output formats depend on the chosen mode of operation; in the default (--full) mode, it takes as input one or more 'raw' files, and produces a 'medium-rare' output file whose analyses correspond to those returned by the dwds_tomasotath v0.4.x series of tokenizers. See mootfiles for details on moot file formats.
FILE(s)
Input files
See also the --list option.
--help
, -h
Print help and exit.
Default: '0'
--version
, -V
Print version and exit.
Default: '0'
--rcfile=FILE
, -cFILE
Read an alternate configuration file.
Default: 'NULL'
See also: "CONFIGURATION FILES".
--verbose=LEVEL
, -vLEVEL
Verbosity level.
Default: '3'
Be more or less verbose. Recognized values are in the range 0..6:
Disable all diagnostic messages.
Print error messages to stderr.
Print warnings to stderr.
Print general diagnostic information to stderr.
Print progress information to stderr.
Print debugging information to stderr (if applicable).
Print execution trace information to stderr (if applicable).
Suppress initial banner message (implied at verbosity levels <= 2)
Default: '0'
--list
, -l
Arguments are input-file lists.
Default: '0'
If this flag is given, the FILE(s) arguments should be lists of input filenames, one filename per line, which should be processed. Otherwise, the FILE(s) arguments are interpreted as filenames of the input files themselves.
--recover
, -r
Attempt to recover from minor errors.
Default: '0'
Minor errors such as missing files, etc. cause an error message to be emitted but do not cause the program to abort if this option is specified. Useful for large automated batch-processing jobs.
--output=FILE
, -oFILE
Write output to FILE.
Default: '-'
Output files are in 'rare' format: one token per line, a blank line indicates a sentence boundary.
--full
, -f
Alias for --scan --lex --tag --decode --annotate (default)
Default: '0'
--train
, -R
Training mode (similar to --lex)
Default: '0'
Runs the WASTE scanner and lexer item-wise on pre-tokenized input, which must contain token text with leading whitespace where appropriate. Embedded special characters can be escaped with backslashes (e.g. \n
, \r
, \t
, \f
, \v
, \\
, and \\
), and any input tokens are truncated at a $=
substring if present. Output is in 'well-done' format suitable for passing to mootrain. Overrides any other other runtime mode options.
--scan
, -s
Enable raw text scanning stage.
Default: '0'
--no-scan
, -S
Disable raw text scanning stage.
Default: '0'
--lex
, -x
Enable lexical classification stage.
Default: '0'
If lexer stage is enabled, you should also specify --abbrevs, --conjunctions, and/or --stopwords as appropriate for your model.
--no-lex
, -X
Disable lexical classification stage.
Default: '0'
--tag
, -t
Enable HMM Viterbi tagging stage.
Default: '0'
Requires --model option.
--no-tag
, -T
Disable HMM Viterbi tagging stage.
Default: '0'
--decode
, -d
Enable post-Viterbi decoding stage.
Default: '0'
--no-decode
, -D
Disable post-Viterbi decoding stage.
Default: '0'
--annotate
, -n
Enable text-based annotation stage.
Default: '0'
--no-annotate
, -N
Disable text-based annotation stage.
Default: '0'
--abbrevs=FILE
, -aFILE
Load abbreviation lexicon from FILE (1 word/line)
Default: 'NULL'
Only meaningful if --lex is enabled.
--conjunctions=FILE
, -jFILE
Load conjunction lexicon from FILE (1 word/line)
Default: 'NULL'
Only meaningful if --lex is enabled.
--stopwords=FILE
, -wFILE
Load stopword lexicon from FILE (1 word/line)
Default: 'NULL'
Only meaningful if --lex is enabled.
--dehyphenate
, -y
Enable automatic dehyphenation in lexer (default)
Default: '1'
Only meaningful if --lex is enabled.
--no-dehyphenate
, -Y
Disable automatic dehyphenation in lexer.
Default: '0'
Only meaningful if --lex is enabled.
--model=MODEL
, -MMODEL
Use HMM tokenizer model MODEL.
Default: 'waste.hmm'
See 'HMM MODEL FILE FORMATS' in mootfiles for details on model file formats. This option is intended to be used with a pre-compiled binary model. If you need to set addtitional runtime options, you should call moot
directly in a pipeline, e.g.
waste --scan -Or,loc INFILE.txt \\
| waste --lex -aabbr.lex -jconj.lex -wstop.lex -Ir,loc -Omr,loc - \\
| moot --stream --model=MODEL --beam-width=100 -Imr,loc -Owd,loc - \\
| waste --decode -Iwd,loc -Om,loc \\
| waste --annotate -Im,loc -Omr,loc -o OUTFILE.mr
--input-format=FORMAT
, -IFORMAT
Specify input or --scan mode format
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation. Only meaningful if the scanner stage has been disabled with the -no-scan (-S) option.
Default value depends on the first enabled processing module:
--scan : 'None'
--lex : 'Text'
--tag : 'Text,Analyzed'
--decode : 'Text,Analyzed,Tagged'
--annotate : 'Text'
--train : 'Text'
See 'I/O Format Flags' in mootfiles for details.
--output-format=FORMAT
, -OFORMAT
Specify output file format.
Default: 'NULL'
Value should be a comma-separated list of format flag names, optionally prefixed with an exclamation point (!) to indicate negation.
Default value depends on the last enabled processing module:
--scan : 'Text'
--lex : 'Text,Analyzed'
--tag : 'Text,Tagged'
--decode : 'Text'
--annotate : 'Text,Analyzed'
--train : 'Text,Analyzed,Tagged'
See 'I/O Format Flags' in mootfiles for details.
Configuration files are expected to contain lines of the form:
LONG_OPTION_NAME OPTION_VALUE
where LONG_OPTION_NAME is the long name of some option, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with '#') are ignored.
No configuration files are read by default.
The --scan and --lex modules require that text data is encoded in UTF-8.
Documentation file auto-generated by optgen.perl version 0.15 using Getopt::Gen version 0.15. Translation was initiated as:
optgen.perl -l --nocfile --nohfile --notimestamp -F waste waste.gog
Unknown.
Initial development of the this was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government. Development of the DynHMM and WASTE extensions was supported by the DFG-funded projects 'Deutsches Textarchiv' ( "German text archive", http://www.deutschestextarchiv.de ) and 'DLEX' at the Berlin-Brandenburgische Akademie der Wissenschaften.
The authors are grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package. Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used by the class-based HMM tagger / disambiguator, without which moot could not have been built. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.
Bryan Jurish <moocow@cpan.org> and Kay-Michael Würzner <wuerzner@bbaw.de>
moot(1), mootrain(1), mootcompile(1), mootfiles, moot, mootchurn