NAME

DTA::TokWrap::Processor::mkbx - DTA tokenizer wrappers: (bx0doc,tx) -> bxdata

SYNOPSIS

 use DTA::TokWrap::Processor::mkbx;
 
 $mbx = DTA::TokWrap::Processor::mkbx->new(%opts);
 $doc_or_undef = $mbx->mkbx($doc);

DESCRIPTION

DTA::TokWrap::Processor::mkbx provides an object-oriented DTA::TokWrap::Processor wrapper for the creation of in-memory serialized text-block-indices.

Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.

Constants

@ISA

DTA::TokWrap::Processor::mkbx inherits from DTA::TokWrap::Processor.

Constructors etc.

new
 $obj = $CLASS_OR_OBJECT->new(%args);

Constructor.

%args, %$obj:

 ##-- Block-sorting: hints
 wbStr => $wbStr,                   ##-- word-break hint text
 sbStr => $sbStr,                   ##-- sentence-break hint text
 sortkey_attr => $attr,             ##-- sort-key attribute (default='dta.tw.key'; should jive with mkbx0)
 
 ##-- Block-sorting: low-level data
 xp    => $xml_parser,              ##-- XML::Parser object for parsing $doc->{bx0doc}
defaults
 %defaults = CLASS->defaults();

Static class-dependent defaults.

init
 $mbx = $mbx->init();

Dynamic object-dependent defaults.

initXmlParser
 $xp = $mbx->initXmlParser();

Create & initialize $mbx->{xp}, an XML::Parser object used to parse $doc->{bx0data}.

Methods: mkbx (bx0doc, txfile) => bxdata

mkbx
 $doc_or_undef = $CLASS_OR_OBJECT->mkbx($doc);

Creates the serialized text-block-index $doc->{bxdata} for the DTA::TokWrap::Document object $doc.

Relevant %$doc keys:

 bx0doc  => $bx0doc,  ##-- (input) preliminary block-index data (XML::LibXML::Document)
 txfile  => $txfile,  ##-- (input) raw text index filename
 bxdata  => \@blocks, ##-- (output) serialized block index
 ##
 mkbx_stamp0 => $f,   ##-- (output) timestamp of operation begin
 mkbx_stamp  => $f,   ##-- (output) timestamp of operation end
 bxdata_stamp => $f,  ##-- (output) timestamp of operation end

Block data: @{$doc->{bxdata}} = @blocks = ($blk0, ..., $blkN); %$blk =

 key    => $sortkey, ##-- (inherited) sort key
 elt    => $eltname, ##-- element name which created this block
 xoff   => $xoff,    ##-- XML byte offset where this block run begins
 xlen   => $xlen,    ##-- XML byte length of this block (0 for hints)
 toff   => $toff,    ##-- raw-text (.tx) byte offset where this block run begins
 tlen   => $tlen,    ##-- raw-text (.tx) byte length of this block (0 for hints)
 otext  => $otext,   ##-- output text (.txt) for this block
 otoff  => $otoff,   ##-- output text (.txt) byte offset where this block run begins
 otlen  => $otlen,   ##-- output text (.txt) length (bytes)
prune_empty_blocks
 \@blocks = $mbx->prune_empty_blocks(\@blocks);
 \@blocks = $mbx->prune_empty_blocks();

Low-level utility.

Removes empty 'c'-type blocks from @blocks (default=$mbx->{blocks}).

sort_blocks
 \@blocks = $mbx->sort_blocks(\@blocks);

Low-level utility.

Sorts \@blocks (default=$mbx->{blocks}) using $mb->{key2i}.

compute_block_text
 \@blocks = $mbx->compute_block_text(\@blocks, \$txbuf);
 \@blocks = $mbx->compute_block_text(\@blocks);
 \@blocks = $mbx->compute_block_text();

Low-level utility.

Sets $blk->{otoff}, $blk->{otlen}, $blk->{otext} for each block $blk in @blocks (default=$mbx->{blocks}) by extracting raw-text (.tx) substrings from \$txbuf (default=$mbx->{txbufr}).

\@blocks should already have been sorted before this method is called.

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

AUTHOR

Bryan Jurish <jurish@bbaw.de>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2018 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.