DTA::TokWrap::Processor::mkbx0 - DTA tokenizer wrappers: sxfile -> bx0doc
use DTA::TokWrap::Processor::mkbx0;
$mbx0 = DTA::TokWrap::Processor::mkbx0->new(%opts);
$doc_or_undef = $mbx0->mkbx0($doc);
##-- debugging
$mbx0_or_undef = $mbx0->ensure_stylesheets();
$mbx0->dump_chain_stylesheet($filename_or_fh);
$mbx0->dump_hint_stylesheet($filename_or_fh);
$mbx0->dump_sort_stylesheet($filename_or_fh);
DTA::TokWrap::Processor::mkindex provides an object-oriented DTA::TokWrap::Processor wrapper for hint insertion and serialization sort-key generation on a text-free "structure index" (.sx) XML file.
Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.
DTA::TokWrap::Processor::mkbx0 inherits from DTA::TokWrap::Processor.
$mbx0 = $CLASS_OR_OBJ->new(%opts)
Constructor.
%opts, %$mbx0:
##-- Programs
rmns => $path_to_xml_rm_namespaces, ##-- default: search
inplace => $bool, ##-- prefer in-place programs for search?
auto_xmlid => $bool, ##-- if true (default), @id attributes will be mapped to @xml:id
auto_prevnext => $bool, ##-- if true (default), @prev|@next chains will be auto-sanitized
##
##-- Styleheet: chain-serialization
chain_stylestr => $stylestr, ##-- xsl stylesheet string for chain-serialization
chain_styleheet => $stylesheet, ##-- compiled xsl stylesheet for chain-serialization
##
##-- Styleheet: insert-hints (<seg> elements and their children are handled implicitly)
hint_sb_xpaths => \@xpaths, ##-- add sentence-break hint (<s/>) for @xpath element open & close
hint_wb_xpaths => \@xpaths, ##-- ad word-break hint (<w/>) for @xpath element open & close
##
hint_stylestr => $stylestr, ##-- xsl stylesheet string
hint_styleheet => $stylesheet, ##-- compiled xsl stylesheet
##
##-- Stylesheet: mark-sortkeys (<seg> elements and their children are handled implicitly)
sortkey_attr => $attr, ##-- sort-key attribute (default: 'dta.tw.key')
sort_ignore_xpaths => \@xpaths, ##-- ignore these xpaths
sort_addkey_xpaths => \@xpaths, ##-- add new sort key for @xpaths
##
sort_stylestr => $stylestr, ##-- xsl stylesheet string
sort_styleheet => $stylesheet, ##-- compiled xsl stylesheet
%defaults = CLASS->defaults();
Static class-dependent defaults.
$mbx0 = $mbx0->init();
Dynamic object-dependent defaults.
$mbx0_or_undef = $mbx0->ensure_stylesheets();
Ensures that required XSL stylesheets have been compiled.
$xsl_str = $mbx0->hint_stylestr();
Returns XSL stylesheet string for the 'insert-hints' transformation, which is responsible for inserting sentence- and token-break hints into the input *.sx document.
$xsl_str = $mbx0->sort_stylestr();
Returns XSL stylesheet string for the 'generate-sort-keys' transformation, which is responsible for inserting top-level serialization-segment keys into the input *.sx document.
$mbx0->dump_chain_stylesheet($filename_or_fh);
Dumps the generated 'serialize-chains' stylesheet to $filename_or_fh.
$mbx0->dump_hint_stylesheet($filename_or_fh);
Dumps the generated 'insert-hints' stylesheet to $filename_or_fh.
$mbx0->dump_sort_stylesheet($filename_or_fh);
Dumps the generated 'generate-sortkeys' stylesheet to $filename_or_fh.
$doc_or_undef = $CLASS_OR_OBJECT->mkbx0($doc);
Applies the XSL pipeline for hint insertion and sort-key generation to the "structure index" (*.sx) document of the DTA::TokWrap::Document object $doc.
Relevant %$doc keys:
sxfile => $sxfile, ##-- (input) structure index filename
bx0doc => $bx0doc, ##-- (output) preliminary block-index data (XML::LibXML::Document)
##
mkbx0_stamp0 => $f, ##-- (output) timestamp of operation begin
mkbx0_stamp => $f, ##-- (output) timestamp of operation end
bx0doc_stamp => $f, ##-- (output) timestamp of operation end
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
Bryan Jurish <jurish@bbaw.de>
Copyright (C) 2009-2018 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.