NAME

DTA::TokWrap::Processor::tcftokenize - DTA tokenizer wrappers: TCF text layer tokenization

SYNOPSIS

 use DTA::TokWrap::Processor::tcftokenize;
 
 $ttok = DTA::TokWrap::Processor::tcftokenize->new(%opts);
 $doc_or_undef = $ttok->tcftokenize($doc);

DESCRIPTION

DTA::TokWrap::Processor::tcftokenize provides an object-oriented DTA::TokWrap::Processor wrapper for tokenizing the TCF text layer with the selected tokenizer and encoding the result in the TCF tokens and sentences layers.

Constants

@ISA

DTA::TokWrap::Processor::tcftokenize inherits from DTA::TokWrap::Processor.

Constructors etc.

new
 $obj = $CLASS_OR_OBJECT->new(%args);

Constructor.

defaults
 %defaults = $CLASS->defaults();

Static class-dependent defaults.

Methods

tcftokenize
 $doc_or_undef = $CLASS_OR_OBJECT->tcftokenize($doc);

Tokenizes the text layer extracted from a TCF document and encodes the result in new TCF tokens and sentences layers.

Relevant %$doc keys:

 tcfdoc    => $tcfdoc,       ##-- (input,output) TCF input document with <text> layer
 ##
 txtfile   => $txtfile,      ##-- (temp,output) text file used for TCF extraction
 tokdata0  => $tokdata,      ##-- (temp,output) raw tokenization data
 tokdata1  => $tokdata1,     ##-- (temp,output) tweaked tokenization data
 ##
 tcftokdoc => $tcftokdoc,    ##-- (output) output TCF file with <sentences>,<tokens> layers (==$tcfdoc)
 tcftokenize_stamp0 => $f,   ##-- (output) timestamp of operation begin
 tcftokenize_stamp  => $f,   ##-- (output) timestamp of operation end
 tcftokdoc_stamp    => $f,   ##-- (output) timestamp of operation end

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

AUTHOR

Bryan Jurish <jurish@bbaw.de>

COPYRIGHT AND LICENSE

Copyright (C) 2014-2018 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.