NAME

DTA::TokWrap::Processor::tcfdecode0 - DTA tokenizer wrappers: TCF[tei,text,tokens,sentences]->TEI,text extraction

SYNOPSIS

 use DTA::TokWrap::Processor::tcfdecode0;
 
 $dec = DTA::TokWrap::Processor::tcfdecode0->new(%opts);
 $doc_or_undef = $dec->tcfdecode0($doc);

DESCRIPTION

DTA::TokWrap::Processor::tcfdecode0 provides an object-oriented DTA::TokWrap::Processor wrapper for extracting the tei,text,tokens, and sentences layers from a tokenized TCF ("Text Corpus Format", cf. http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format) document as originally encoded by a DTA::TokWrap::Processor::tcfencode ("tcfencoder") object. The encoded TCF document should have the following layers:

textSource[@type="application/tei+xml"]

Source TEI-XML encoded as an XML text node; should be identical to the source XML {xmlfile} or {xmldata} passed to the tcfencoder. Also accepts type "text/tei+xml".

text

Serialized text encoded as an XML text node; should be identical to the serialized text {txtfile} or {txtdata} passed to the tcfencoder.

tokens

Tokens returned by the tokenizer for the text layer. Document order of tokens should correspond exactly to the serial order of the associated text in the text layer.

sentences

Sentences returned by the tokenizer for the tokens in the tokens layer. Document order of sentences must correspond exactly to the serial order of the associated text in the text layer.

Constants

@ISA

DTA::TokWrap::Processor::tcfdecode0 inherits from DTA::TokWrap::Processor.

Constructors etc.

new
 $obj = $CLASS_OR_OBJECT->new(%args);

Constructor.

defaults
 %defaults = $CLASS->defaults();

Static class-dependent defaults.

Methods

tcfdecode0
 $doc_or_undef = $CLASS_OR_OBJECT->tcfdecode0($doc);

Decode0s the {tcfdoc} key of the DTA::TokWrap::Document object to TCF, storing the result in $doc->{tcfxdata}, $doc->{tcftdata}, and $doc->{tcfwdata}.

Relevant %$doc keys:

 tcfdoc   => $tcfdoc,   ##-- (input) TCF input document
 ##
 tcfxdata => $tcfxdata, ##-- (output) TEI-XML decode0d from TCF
 tcftdata => $tcftdata, ##-- (output) text data decode0d from TCF
 tcfwdata => $tcfwdata, ##-- (output) tokenized data decode0d from TCF, without byte-offsets, with "SID/WID" attributes
 ##
 tcfdecode0_stamp0 => $f, ##-- (output) timestamp of operation begin
 tcfdecode0_stamp  => $f, ##-- (output) timestamp of operation end
 tcfxdata_stamp   => $f, ##-- (output) timestamp of operation end
 tcftdata_stamp   => $f, ##-- (output) timestamp of operation end
 tcfwdata_stamp   => $f, ##-- (output) timestamp of operation end

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

AUTHOR

Bryan Jurish <jurish@bbaw.de>

COPYRIGHT AND LICENSE

Copyright (C) 2014 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.