NAME

DTA::TokWrap::Processor::tok2xml - DTA tokenizer wrappers: t -> t.xml

SYNOPSIS

 use DTA::TokWrap::Processor::tok2xml;
 
 $t2x = DTA::TokWrap::Processor::tok2xml->new(%opts);
 $doc_or_undef = $t2x->tok2xml($doc);

DESCRIPTION

DTA::TokWrap::Processor::tok2xml provides an object-oriented DTA::TokWrap::Processor wrapper for converting "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format, for use with DTA::TokWrap::Document objects.

Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.

Constants

@ISA

DTA::TokWrap::Processor::tok2xml inherits from DTA::TokWrap::Processor.

$NOC

Integer indicating a missing or implicit 'c' record; should be equivalent in value to the C code:

 unsigned int NOC = ((unsigned int)-1)

for 32-bit "unsigned int"s.

Constructors etc.

new
 $t2x = $CLASS_OR_OBJECT->new(%args);

Constructor.

%args, %$t2x:

  txmlsort => $bool,             ##-- if true (default), sort output .t.xml data as close to input document-order as __paragraph__ boundaries will allow
  txmlsort_bysentence => $bool,  ##-- use old sentence-level sort (default: false)
  txmlextids => $bool,           ##-- if true, attempt to parse "<a>$SID/$WID</a>" pseudo-analyses as IDs (default:true; uses regex hack)
  t2x => $path_to_dtatw_tok2xml, ##-- default: search
  b2xb => $path_to_dtatw_b2xb,   ##-- default: search; 'off' to disable
  inplace => $bool,              ##-- prefer in-place programs for search?

You probably should NOT change any of the default output document structure options (unless this is the final module in your processing pipeline), since their values have ramifications beyond this module.

defaults
 %defaults = CLASS->defaults();

Static class-dependent defaults.

Methods: tok2xml (bxdata, tokdata1, cxdata) => xtokdata

tok2xml
 $doc_or_undef = $CLASS_OR_OBJECT->tok2xml($doc);
 $doc_or_undef = $CLASS_OR_OBJECT->tok2xml($doc,%opts);

Converts "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format in the DTA::TokWrap::Document object $doc. If specified, %opts override $CLASS_OR_OBJECT sorting and parsing defaults.

Relevant %$doc keys:

 bxdata        => \@bxdata,   ##-- (input) block index data
 $tokfile_key  => $tokfile,  ##-- (input) tokenizer output filename (default='tokfile1')
 cxdata        => \@cxchrs,   ##-- (input) character index data (array of arrays)
 cxfile        => $cxfile,    ##-- (input) character index file
 $xtokdata_key => $xtokdata,  ##-- (output) tokenizer output as XML (default='xtokdata')
 nchrs         => $nchrs,     ##-- (output) number of character index records
 ntoks         => $ntoks,     ##-- (output) number of tokens parsed
 ##
 tok2xml_stamp0 => $f,   ##-- (output) timestamp of operation begin
 tok2xml_stamp  => $f,   ##-- (output) timestamp of operation end
 xtokdata_stamp => $f,   ##-- (output) timestamp of operation end

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

AUTHOR

Bryan Jurish <jurish@bbaw.de>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2018 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.