NAME

DTA::TokWrap::Processor::tokenize1 - DTA tokenizer wrappers: tokenizer post-processing

SYNOPSIS

 use DTA::TokWrap::Processor::tokenize1;
 
 $tp = DTA::TokWrap::Processor::tokenize1->new(%args);
 $doc_or_undef = $tp->tokenize1($doc);

DESCRIPTION

DTA::TokWrap::Processor::tokenize1 provides an object-oriented DTA::TokWrap::Processor wrapper for post-processing of raw tokenizer output for DTA::TokWrap::Document objects.

Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.

Constants

@ISA

DTA::TokWrap::Processor::tokenize1 inherits from DTA::TokWrap::Processor.

Constructors etc.

new
 $tp = $CLASS_OR_OBJ->new(%args);

%args, %$tp:

 fixtok => $bool,  ##-- attempt to fix common tokenizer errors? (default=true)
 fixold => $bool,  ##-- attempt to fix unexpected and/or obsolete (tomata2) errors? (default=false)
defaults
 %defaults = CLASS->defaults();

Static class-dependent defaults.

Methods

tokenize1
 $doc_or_undef = $CLASS_OR_OBJECT->tokenize1($doc);

Runs the low-level tokenizer on the serialized text from the DTA::TokWrap::Document object $doc.

Relevant %$doc keys:

  tokdata0 => $tokdata0,  ##-- (input)  raw tokenizer output (string)
  tokdata1 => $tokdata1,  ##-- (output) post-processed tokenizer output (string)
  tokenize1_stamp => $f,  ##-- (output) timestamp of operation end
  tokdata1_stamp  => $f,  ##-- (output) timestamp of operation end

may implicitly call $doc->tokenize() (but shouldn't).

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

AUTHOR

Bryan Jurish <jurish@bbaw.de>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2018 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.