dta-tokwrap.perl - top-level tokenizer wrapper for DTA XML documents
dta-tokwrap.perl [OPTIONS] XMLFILE(s)...
General Options:
-help # show this help message
-man # show complete manpage
-verbose LEVEL # set verbosity level (0<=level<=7; default=1)
Make Emulation Options:
-list-targets # just list known targets
-targets TARGETS # set build targets (default='all')
-make , -nomake # do/don't emulate make-style dependency tracking (default=don't)
-remake # force rebuilding of all targets (implies -make)
-force-target TARGET # for -make mode, force rebuilding of TARGET
-force # alias for -force-target=all
-noforce # overrides all preceeding -force and -force-target flags
Subprocessor Options:
-rcdir RCDIR # resource directory (default=$ENV{TOKWRAP_RCDIR} or /usr/local/share/dta-resources)
-inplace , -noinplace # do/don't use locally built programs if available (default=do)
-sb-xpath XPATH # add sentence-break hints on XPATH (element) open and close
-wb-xpath XPATH # add word-break hints on XPATH (element) open and close
-hints, -nohints # do/don't generate "hints" for the tokenizer (default=do)
-weak-hints # use whitespace-only hints rather than defaults ($WB$,$SB$)
-strong-hints # opposite of -weak-hints
-abbrev-lex=FILE # abbreviation lexicon for dwds_tomasotath or waste tokenizer
-mwe-lex=FILE # multiword-expression lexicon for dwds_tomasotath tokenizer
-stop-lex=FILE # stopword lexicon for waste tokenizer
-conj-lex=FILE # conjunction lexicon for waste tokenizer
-waste-model=FILE # HMM file for waste tokenizer
-waste-dir=DIR # waste base directory (defaults for -abbr-lex, -stop-lex, -conj-lex, -waste-model)
-procopt OPT=VALUE # set arbitrary subprocessor options
I/O Options:
-outdir OUTDIR # set output directory (default=.)
-tmpdir TMPDIR # set temporary directory (default=$ENV{DTATW_TMP} or OUTDIR)
-keep , -nokeep # do/don't keep temporary files (default=don't)
-format , -noformat # do/don't pretty-print XML output (default=do)
-docopt OPT=VALUE # set arbitrary document options (e.g. filenames)
Logging Options:
-log-config RCFILE # use Log::Log4perl configuration file RCFILE (default=internal)
-log-level LEVEL # set minimum log level
-log-file LOGFILE # log to file LOGFILE (default=none)
-stderr , -nostderr # do/don't log to console (default=do)
-profile , -noprofile # do/don't log profiling information (default=do)
-silent , -quiet # alias for -verbose=0 -log-level=FATAL -notrace
Trace and Debugging Options:
-dump-xsl PREFIX # dump generated XSL stylesheets to PREFIX*.xsl and exit
-dummy , -nodummy # don't/do actually run any subprocessors (default=do)
-tokenizer-class CLASS # specify tokenizer subclass (e.g. http, waste, dummy, tomasotath_04x, ...)
-dummy-tokenizer # alias for -tokenizer-class=dummy
-http-tokenizer # alias for -tokenizer-class=http
-trace , -notrace # do/don't log trace messages (default: depends on -verbose)
-traceAll # enable logging of all possible trace messages
-notraceAll # disable logging of all possible trace messages
-traceLevel LEVEL # set trace logging level (default='trace')
-traceX, -notraceX # do/don't trace "X" (X={Open|Load|Save|Make|...})
-traceXLevel LEVEL # set log level for "X" traces (X={Open|...})
Display a short help message and exit.
Display the complete program manpage and exit.
Set verbosity level (0<=level<=7; default=0)
Set build targets (default="all"). Multiple TARGETS may be separated by whitespace, commas, or by passing multiple -targets options. See "Known Targets" for a list of currently defined targets.
Do/don't emulate experimental make-style dependency tracking (default=don't). Use of -make
mode may be faster (because it requires less file I/O).
Force rebuilding of all targets (implies -make).
For -make mode, force rebuilding of TARGET.
Alias for -force-target=all
Overrides all preceeding "-force" and -force-target flags.
Do/don't use locally built programs if available (default=do). This is useful if you want to test a development version (-inplace
) and an installed system version (-noinplace
) of this package on the same machine.
Tells the mkbx0
subprocessor to add sentence-break hints on XPATH (which should resolve only to element nodes) open and close. XPATH is included in the generated hint.xsl XSL stylesheet as a match
item, so it can include e.g. top-level unions, but no nested unions.
This option may be specified more than once.
Tells the mkbx0
subprocessor to add sentence-break hints on XPATH (which should resolve only to element nodes) open and close. Same caveats as for "-sb-xpath XPATH"
This option may be specified more than once.
Do/don't generate explicit sentence- and/or token-break "hints" for the tokenizer in the temporary .txt file (default=do). Explicit hint strings can be set with -procopt wbStr=WORDBREAK_HINT_STRING
and/or -procopt sbStr=SENTBREAK_HINT_STRING
; see -procopt below for details.
If generating tokenizer "hints", use whitespace-only hints rather than defaults "\n$WB$\n", "\n$SB$\n". This can be useful if your low-level tokenizer doesn't understand the explicit hints, but might be predisposed to break tokens and/or sentences on whitespace.
Opposite of -weak-hints.
Abbreviation lexicon for dwds_tomasotath tokenizer. Default is (usually) /usr/local/share/dta-resources/dta_abbrevs.lex.
FILE may be specified as the empty string to avoid use of an abbreviation lexicon altogether, although this is likely to weak havoc with dwds_tomasotath's sentence-boundary recognition.
Multiword-expression lexicon for dwds_tomasotath tokenizer. Default is (usually) /usr/local/share/dta-resources/dta_mwe.lex.
FILE may be specified as the empty string to avoid use of a multiword-expression lexicon altogether, although this might cause problems with dwds_tomasotath.
Set a literal arbitrary subprocessor option OPT to VALUE. See subprocessor module documentation for available options.
Set output directory (default=.)
Set directory for storing temporary files. Default value is taken from the environment variable $DTATW_TMP
if it is set, otherwise the default is the value of OUTDIR (see -outdir).
Do/don't keep temporary files, rather than deleting them when they are no longer needed (default=don't).
Do/don't pretty-print XML output when possible (default=do).
Set arbitrary DTA::TokWrap::Document options (e.g. filenames). See DTA::TokWrap::Document(3pm) for details.
Use Log::Log4perl configuration file RCFILE, rather than the default internal configuration. See Log::Log4perl(3pm) for details on the syntax of RCFILE.
Set minimum log level. Only effective if the default (internal) log configuration is being used.
Send log output to file LOGFILE (default=none). Only effective if the default (internal) log configuration is being used.
Do/don't log to console (default=do). Only effective if the default (internal) log configuration is being used.
Do/don't log profiling information (default=do).
Alias for -verbose=0 -log-level=FATAL -notrace
.
Dumps generated XSL stylesheets to PREFIX*.xsl and exits. Useful for debugging. Causes the following files to be written:
${PREFIX}mkbx0_hint.xsl # hint insertion
${PREFIX}mkbx0_sort.xsl # serialization sort-key generation
${PREFIX}standoff_t2s.xsl # master XML to sentence standoff
${PREFIX}standoff_t2w.xsl # master XML to token standoff
${PREFIX}standoff_t2a.xsl # master XML to analysis standoff
Don't/do actually run any subprocessors (default=do)
Do/don't use locally built dummy tokenizer instead of tomata2.
Do/don't log trace messages (default: depends on the current -verbose
level; see -verbose).
Enable logging of all possible trace messages. Warning: this generates a lot of log output.
Disable logging of all possible trace messages.
Set log level to use for trace messages (default='trace'). LEVEL
is one of the following: trace, debug, info, warn, error, fatal
. Any other value for LEVEL
causes trace messages not to be logged.
Do/don't log trace messages for the trace flavor X, where X is one of the following:
Open # document object open() method
Close # document object close() method
Proc # document processing method calls
Load # load document data file
Save # save document data file
Make # document target (re-)making (including status-check)
Gen # document target (re-)generation
Subproc # low-level subprocessor calls
Run # external system command
Set log level for X-type traces to LEVEL. X is a trace message flavor as described in -traceX, and LEVEL is as described in -traceLevel.
All other command-line arguments are assumed to be filenames of DTA "base-format" XML files, which are simply (TEI-conformant) UTF-8 encoded XML files with one (optional as of dta-tokwrap v0.38) <c>
element per character:
the document MUST be encoded in UTF-8,
all text nodes to be tokenized should be descendants of a <text>
element, and may optionally be immediate daughters of a <c>
element (XPath //text//text()|//text//c/text()
). <c>
elements may not be nested.
Prior to dta-tokwrap v0.38, <c>
elements were required.
This program is intended to provide a flexible high-level command-line interface to the tokenization of DTA "base-format" XML documents, generating e.g. sentence-, token-, and analysis-level standoff XML annotations for each input document.
The problem can be run in one of two main modes; see "Modes of Operation" for details on these. In either mode, it can be used either as a standalone batch-processor for one or more input documents, or called by a superordinate build system, e.g. GNU make
(see make(1)
). Program operation is controlled primarily by the specification of one or more "targets" to build for each input document; see "Known Targets" for details.
The program can be run in one of two modes of operation, "-make Mode" and "-nomake Mode".
(DEPRECATED)
In this (deprecated) mode, the program attempts to emulate the dependency tracking features of make
by (re-)building only those targets which either do not yet exist, or which are older than one or more of their dependencies. Since some dependencies are ephemeral, existing only in RAM during a single program run, this can mean a lot of pain for comparatively little gain.
-make mode is enabled by specifying the -make option on the command-line.
In this (experimental) mode, no implicit dependency tracking is attempted, and all required data files (input, "temporary", and/or output) must exist when the requested target is built; otherwise an error results. -nomake mode can be somewhat slower than -make mode, since "temporary" data (which in -make mode are RAM-only ephemera) may need to be bounced off the filesystem.
-nomake mode is the default mode, and may be (re-)enabled (overriding any preceding -make
option) by specifying the -nomake option on the command-line.
The following targets are known values for the -targets option in "-make Mode":
The following targets are known values for the -targets option in "-nomake Mode":
Alias(es): cx sx tx xx
Input(s): FILE.xml
Output(s): FILE.cx, FILE.sx, FILE.tx
Creates temporary "character index" FILE.cx (CSV), "structure index" FILE.sx (XML without <c>
elements), and "text index" FILE.tx (raw text, unserialized) for each input document FILE.xml.
Alias(es): bx0
Input(s): FILE.sx
Output(s): FILE.bx0
Creates temporary hint- and serialization index FILE.bx0 for each input document FILE.xml
Alias(es): mktxt bx txt
Input(s): FILE.bx0, FILE.tx
Output(s): FILE.bx, FILE.txt
Creates temporary serialized block-index file FILE.bx and serialized text file FILE.txt for each input document FILE.xml.
Alias(es): tokenize0 tok0 t0 tt0
Input(s): FILE.txt
Output(s): FILE.t0
Creates temporary CSV-format raw tokenizer output file FILE.t0 for each input document FILE.xml
Alias(es): tokenize1 tok1 t1 tt1
Input(s): FILE.t0
Output(s): FILE.t1
Creates temporary CSV-format post-processed tokenizer output file FILE.t1 for each input document FILE.xml
Alias(es): tokenize tok t tt
Input(s): FILE.txt
Output(s): FILE.t0 FILE.t1
Wrapper for "mktok0 mktok1".
Alias(es): tok2xml xtok txml ttxml tokxml
Input(s): FILE.t, FILE.bx, FILE.cx
Output(s): FILE.t.xml
Creates master tokenized XML output file FILE.t.xml for each input document FILE.xml
Alias(es): mkcws cwsxml cws
Input(s): FILE.xml FILE.t.xml
Output(s): FILE.cws.xml
Creates "spliced" XML output "Frankenfile" FILE.cws.xml for each input document FILE.xml ; see also dtatw-splice.perl(1).
Alias(es): mksos sosxml sosfile sxml
Input(s): FILE.t.xml
Output(s): FILE.s.xml
DEPRECATED
Creates sentence-level stand-off XML file FILE.s.xml for each input document FILE.xml
Alias(es): mksow sowxml sowfile wxml
Input(s): FILE.t.xml
Output(s): FILE.w.xml
DEPRECATED
Creates token-level stand-off XML file FILE.w.xml for each input document FILE.xml
Alias(es): mksoa sowaml soafile axml
Input(s): FILE.t.xml
Output(s): FILE.a.xml
DEPRECATED
Creates token-analysis-level stand-off XML file FILE.a.xml for each input document FILE.xml
Alias(es): standoff so mkso
DEPRECATED
Alias(es): (none)
Input(s): FILE.xml
Output(s): FILE.t.xml, FILE.cws.xml
Alias for all targets required to generated the target's output files (master tokenized file and spliced output) from the input document, run in the proper order.
Aliases: (none)
Input(s): FILE.xml
Output(s): FILE.t
Alias for all targets required to generated fixed tokenizer output FILE.t from a TEI-XML file FILE.xml, run in the proper order.
Aliases: (none)
Input(s): FILE.xml
Output(s): FILE.t.xml
Alias for all targets required to generated a flat tokeized XML file FILE.t.xml from a TEI-XML file FILE.xml, run in the proper order.
DTA::TokWrap::Intro(3pm), dtatw-add-c.perl(1), dtatw-add-w.perl(1), dtatw-add-s.perl(1), dtatw-rm-c.perl(1), dtatw-splice.perl(1), ...
Bryan Jurish <jurish@bbaw.de>