NAME
SYNOPSIS
DESCRIPTION
ARGUMENTS
OPTIONS
CONFIGURATION FILES
ADDENDA
- Caveats
- About this Document
BUGS AND LIMITATIONS
ACKNOWLEDGEMENTS
AUTHOR
SEE ALSO

NAME

waste - Word- and Sentence-Token Extractor using a Hidden Markov Model

SYNOPSIS

waste [OPTIONS] FILE(s)

 Arguments:
    FILE(s)  Input files

 Options
    -h        --help                  Print help and exit.
    -V        --version               Print version and exit.
    -cFILE    --rcfile=FILE           Read an alternate configuration file.
    -vLEVEL   --verbose=LEVEL         Verbosity level.
    -B        --no-banner             Suppress initial banner message (implied at verbosity levels <= 2)
    -l        --list                  Arguments are input-file lists.
    -r        --recover               Attempt to recover from minor errors.
    -oFILE    --output=FILE           Write output to FILE.

 Mode Options
    -f        --full                  Alias for --scan --lex --tag --decode --annotate (default)
    -R        --train                 Training mode (similar to --lex)
    -s        --scan                  Enable raw text scanning stage.
    -S        --no-scan               Disable raw text scanning stage.
    -x        --lex                   Enable lexical classification stage.
    -X        --no-lex                Disable lexical classification stage.
    -t        --tag                   Enable HMM Viterbi tagging stage.
    -T        --no-tag                Disable HMM Viterbi tagging stage.
    -d        --decode                Enable post-Viterbi decoding stage.
    -D        --no-decode             Disable post-Viterbi decoding stage.
    -n        --annotate              Enable text-based annotation stage.
    -N        --no-annotate           Disable text-based annotation stage.

 Lexer Options
    -aFILE    --abbrevs=FILE          Load abbreviation lexicon from FILE (1 word/line)
    -jFILE    --conjunctions=FILE     Load conjunction lexicon from FILE (1 word/line)
    -wFILE    --stopwords=FILE        Load stopword lexicon from FILE (1 word/line)
    -y        --dehyphenate           Enable automatic dehyphenation in lexer (default)
    -Y        --no-dehyphenate        Disable automatic dehyphenation in lexer.

 HMM Options
    -MMODEL   --model=MODEL           Use HMM tokenizer model MODEL.

 Format Options
    -IFORMAT  --input-format=FORMAT   Specify input or --scan mode format
    -OFORMAT  --output-format=FORMAT  Specify output file format.

DESCRIPTION

Word- and Sentence-Token Extractor using a Hidden Markov Model

waste is the top-level command-line interface to the moot/WASTE HMM tokenizer system. It can be used as a complete tokenization pipeline (--full, the default), as an annotator for pre-tokenized training corpora (--train), or as a standalone scanner (--scan), lexical encoder (--lex), HMM disambiguator (--tag), lexical decoder (--decode), lexical annotator (--annotate), or as (almost) any coherent combination of the above components. Input and output formats depend on the chosen mode of operation; in the default (--full) mode, it takes as input one or more 'raw' files, and produces a 'medium-rare' output file whose analyses correspond to those returned by the dwds_tomasotath v0.4.x series of tokenizers. See mootfiles for details on moot file formats.

ARGUMENTS

FILE(s)

Input files

OPTIONS

--help , -h

Print help and exit.

Default: '0'

--version , -V

Print version and exit.

Default: '0'

--rcfile=FILE , -cFILE

Read an alternate configuration file.

Default: 'NULL'

Mode Options

--full , -f

Alias for --scan --lex --tag --decode --annotate (default)

Default: '0'

--train , -R

Training mode (similar to --lex)

Default: '0'

Runs the WASTE scanner and lexer item-wise on pre-tokenized input, which must contain token text with leading whitespace where appropriate. Embedded special characters can be escaped with backslashes (e.g. \n, \r, \t, \f, \v, \\, and \\ ), and any input tokens are truncated at a $= substring if present. Output is in 'well-done' format suitable for passing to mootrain. Overrides any other other runtime mode options.

--scan , -s

Enable raw text scanning stage.

Default: '0'

--no-scan , -S

Disable raw text scanning stage.

Default: '0'

--lex , -x

Enable lexical classification stage.

Default: '0'

If lexer stage is enabled, you should also specify --abbrevs, --conjunctions, and/or --stopwords as appropriate for your model.

--no-lex , -X

Disable lexical classification stage.

Default: '0'

--tag , -t

Enable HMM Viterbi tagging stage.

Default: '0'

Requires --model option.

--no-tag , -T

Disable HMM Viterbi tagging stage.

Default: '0'

--decode , -d

Enable post-Viterbi decoding stage.

Default: '0'

--no-decode , -D

Disable post-Viterbi decoding stage.

Default: '0'

--annotate , -n

Enable text-based annotation stage.

Default: '0'

--no-annotate , -N

Disable text-based annotation stage.

Default: '0'