DTA::TokWrap::Intro - a gentle introduction to the DTA::TokWrap distribution
The DTA::TokWrap perl distribution contains various modules for operations associated with the tokenization of DTA "base-format" XML documents. The distribution is divided into 1 program and 3 main modules:
Top-level command-line interface. Use this if you can. See dta-tokwrap.perl(1) for details.
Top-level wrappers for persistent data associated with document tokenization. Encapsulates all default DTA::TokWrap::Processor sub-processor objects. See the DTA::TokWrap section for more details.
Top-level wrappers for per-document data, including temporary files, indices, and in-memory data structures. See the DTA::TokWrap::Document section for more details.
Abstraction level for processing operations on document data. See the DTA::TokWrap::Processor section for more details.
These and other included modules are briefly described in the "MODULES" section, below.
The following sections are intended to give a brief overview of the modules included with this distribution.
The DTA::TokWrap module provides top-level object-oriented wrappers for (batch) tokenization of DTA "base-format" XML documents. DTA::TokWrap objects encapsulate all default DTA::TokWrap::Processor objects under a single object. Most document processing should proceed via a DTA::TokWrap object.
See DTA::TokWrap(3pm) for more details.
DTA::TokWrap::Document provides a perl class for representing a single DTA base-format XML file and associated indices, temporary files, and stand-off files. Together with the DTA::TokWrap module, this class comprises the top-level API of the DTA::TokWrap distribution.
See DTA::TokWrap::Document(3pm) for more details.
***DEPRECATED***
DTA::TokWrap::Document::Maker provides an experimental DTA::TokWrap::Document subclass which attempts to perform make
-like dependency tracking on document data keys.
See DTA::TokWrap::Document::Maker(3pm) for more details.
The DTA::TokWrap::Processor package provides an abstract base class which subsumes document-processing modules included in the DTA::TokWrap distribution.
See DTA::TokWrap::Processor(3pm) for details on the API.
DTA::TokWrap::Processor::mkindex provides an object-oriented DTA::TokWrap::Processor wrapper around the dtatw-mkindex C program for DTA::TokWrap::Document objects.
See DTA::TokWrap::Processor::mkindex(3pm) for details.
DTA::TokWrap::Processor::mkindex provides an object-oriented DTA::TokWrap::Processor wrapper for hint insertion and serialization sort-key generation on a text-free "structure index" (.sx) XML file.
See DTA::TokWrap::Processor::mkbx0(3pm) for details.
DTA::TokWrap::Processor::mkbx provides an object-oriented DTA::TokWrap::Processor wrapper for the creation of in-memory serialized text-block-indices.
See DTA::TokWrap::Processor::mkbx(3pm) for details.
This class is just an abstract placeholder for a low-level tokenizer. By default, it attempts automatically detect a supported tokenizer on your system (preferably moot/WASTE). Depending on your needs, you may wish to use e.g. DTA::TokWrap::Processor::tokenize::waste or DTA::TokWrap::Processor::tokenize::http directly, or to set the package variable DTA::TokWrap::Processor::tokenize to the default tokenizer subclass name for your system.
DTA::TokWrap::Processor::tokenize provides an object-oriented DTA::TokWrap::Processor wrapper for the tokenization of serialized text files for DTA::TokWrap::Document objects.
See DTA::TokWrap::Processor::tokenize(3pm) for details.
DTA::TokWrap::Processor::tokenize::dummy provides a package-local alternative to the "official" low-level tokenizer class DTA::TokWrap::Processor::tokenize.
See DTA::TokWrap::Processor::tokenize::dummy(3pm) for details.
DTA::TokWrap::Processor::tokenize1 provides an object-oriented DTA::TokWrap::Processor wrapper for some required and/or optional post-processing of tokenized files used by DTA::TokWrap::Document objects.
See DTA::TokWrap::Processor::tokenize1(3pm) for details.
DTA::TokWrap::Processor::tok2xml provides an object-oriented DTA::TokWrap::Processor wrapper for converting "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format, for use with DTA::TokWrap::Document objects.
See DTA::TokWrap::Processor::tok2xml(3pm) for details.
***OBSOLETE***
DTA::TokWrap::Processor::standoff provides an object-oriented DTA::TokWrap::Processor wrapper for generation of various standoff XML formats for DTA::TokWrap::Document objects.
See DTA::TokWrap::Processor::standoff(3pm) for details.
DTA::TokWrap::Processor::standoff provides an object-oriented DTA::TokWrap::Processor wrapper for splicing tokenization data (word- and sentence-boundaries) back into a source TEI-XML file, potentially fragmenting words and/or sentences in the process. Each segment is assigned a unique id, and fragmented segments are associated using the TEI prev
and next
attributes.
See DTA::TokWrap::Processor::addws(3pm) for details.
DTA::TokWrap::Processor::standoff provides an object-oriented DTA::TokWrap::Processor wrapper for splicing stand-off data into a base XML file by matching ids.
See DTA::TokWrap::Processor::idsplice(3pm) for details.
DTA::TokWrap::Base provides an abstract base class for all object classes in the DTA::TokWrap distribution
See DTA::TokWrap::Base(3pm) for details.
DTA::TokWrap::CxData provides utilities for binary I/O on dta-tokwrap *.cx files.
See DTA::TokWrap::CxData(3pm) for details.
DTA::TokWrap::Logger provides an abstract base class for object-oriented access to the Log::Log4perl logging facility.
See DTA::TokWrap::Logger(3pm) for details.
DTA::TokWrap::Utils provides diverse assorted miscellaneous utilities which don't fit well anywhere else and which don't on their own justify the creation of a new package.
See DTA::TokWrap::Logger(3pm) for details.
Version constants for DTA::TokWrap. Intended for (direct) use only by DTA::TokWrap sub-modules.
dta-tokwrap.perl(1), DTA::TokWrap(3pm), DTA::TokWrap::Document(3pm), DTA::TokWrap::Processor(3pm), ...
Bryan Jurish <jurish@bbaw.de>