DTA::TokWrap::Processor::mkbx - DTA tokenizer wrappers: (bx0doc,tx) -> bxdata
use DTA::TokWrap::Processor::mkbx;
$mbx = DTA::TokWrap::Processor::mkbx->new(%opts);
$doc_or_undef = $mbx->mkbx($doc);
DTA::TokWrap::Processor::mkbx provides an object-oriented DTA::TokWrap::Processor wrapper for the creation of in-memory serialized text-block-indices.
Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.
DTA::TokWrap::Processor::mkbx inherits from DTA::TokWrap::Processor.
$obj = $CLASS_OR_OBJECT->new(%args);
Constructor.
%args, %$obj:
##-- Block-sorting: hints
wbStr => $wbStr, ##-- word-break hint text
sbStr => $sbStr, ##-- sentence-break hint text
sortkey_attr => $attr, ##-- sort-key attribute (default='dta.tw.key'; should jive with mkbx0)
##-- Block-sorting: low-level data
xp => $xml_parser, ##-- XML::Parser object for parsing $doc->{bx0doc}
%defaults = CLASS->defaults();
Static class-dependent defaults.
$mbx = $mbx->init();
Dynamic object-dependent defaults.
$xp = $mbx->initXmlParser();
Create & initialize $mbx->{xp}, an XML::Parser object used to parse $doc->{bx0data}.
$doc_or_undef = $CLASS_OR_OBJECT->mkbx($doc);
Creates the serialized text-block-index $doc->{bxdata} for the DTA::TokWrap::Document object $doc.
Relevant %$doc keys:
bx0doc => $bx0doc, ##-- (input) preliminary block-index data (XML::LibXML::Document)
txfile => $txfile, ##-- (input) raw text index filename
bxdata => \@blocks, ##-- (output) serialized block index
##
mkbx_stamp0 => $f, ##-- (output) timestamp of operation begin
mkbx_stamp => $f, ##-- (output) timestamp of operation end
bxdata_stamp => $f, ##-- (output) timestamp of operation end
Block data: @{$doc->{bxdata}} = @blocks = ($blk0, ..., $blkN); %$blk =
key => $sortkey, ##-- (inherited) sort key
elt => $eltname, ##-- element name which created this block
xoff => $xoff, ##-- XML byte offset where this block run begins
xlen => $xlen, ##-- XML byte length of this block (0 for hints)
toff => $toff, ##-- raw-text (.tx) byte offset where this block run begins
tlen => $tlen, ##-- raw-text (.tx) byte length of this block (0 for hints)
otext => $otext, ##-- output text (.txt) for this block
otoff => $otoff, ##-- output text (.txt) byte offset where this block run begins
otlen => $otlen, ##-- output text (.txt) length (bytes)
\@blocks = $mbx->prune_empty_blocks(\@blocks);
\@blocks = $mbx->prune_empty_blocks();
Low-level utility.
Removes empty 'c'-type blocks from @blocks (default=$mbx->{blocks}).
\@blocks = $mbx->sort_blocks(\@blocks);
Low-level utility.
Sorts \@blocks (default=$mbx->{blocks}) using $mb->{key2i}.
\@blocks = $mbx->compute_block_text(\@blocks, \$txbuf);
\@blocks = $mbx->compute_block_text(\@blocks);
\@blocks = $mbx->compute_block_text();
Low-level utility.
Sets $blk->{otoff}, $blk->{otlen}, $blk->{otext} for each block $blk in @blocks (default=$mbx->{blocks}) by extracting raw-text (.tx) substrings from \$txbuf (default=$mbx->{txbufr}).
\@blocks should already have been sorted before this method is called.
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
Bryan Jurish <jurish@bbaw.de>
Copyright (C) 2009-2018 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.