NAME

DTA::CAB::Chain::DTA - Deutsches Textarchiv canonicalization chain class

SYNOPSIS

 use DTA::CAB::Chain::DTA;
 
 ##========================================================================
 ## Methods
 
 $obj = CLASS_OR_OBJ->new(%args);
 $ach = $ach->setupChains();
 $bool = $ach->ensureLoaded();
 $bool = $anl->doAnalyze(\%opts, $name);
 $doc = $ach->analyzeClean($doc,\%opts);
 

DESCRIPTION

DTA::CAB::Chain::DTA is the DTA::CAB::Analyzer subclass implementing the robust orthographic canonicalization cascade used in the Deutsches Textarchiv project. This class inherits from DTA::CAB::Chain::Multi. See the "setupChains" method for a list of supported sub-chains and the corresponding analyers.

Methods

new
 $obj = CLASS_OR_OBJ->new(%args);

%$obj, %args:

 ##-- paranoia
 autoClean => 0,  ##-- always run 'clean' analyzer regardless of options; checked in both doAnalyze(), analyzeClean()
 defaultChain => 'default',
 ##
 ##-- overrides
 chains => undef, ##-- see setupChains() method
 chain => undef, ##-- see setupChains() method

Additionally, the following sub-analyzers are defined as fields of %$obj:

tokpp

Token preprocessor, a DTA::CAB::Analyzer::TokPP object.

xlit

Transliterator, a DTA::CAB::Analyzer::Unicruft object.

lts

Phonetizer (Letter-to-Sound mapper), a DTA::CAB::Analyzer::LTS object.

morph

Morphological analyzer (TAGH), a DTA::CAB::Analyzer::Morph object.

mlatin

Latin pseudo-morphology, a DTA::CAB::Analyzer::Morph::Latin object.

msafe

Morphological security heuristics, a DTA::CAB::Analyzer::MorphSafe object.

rw

Weighted finite-state rewrite cascade, a DTA::CAB::Analyzer::Rewrite object.

Date-optimized variants rw.1600-1700, rw.1700-1800, and rw.1800-1900 may also be included.

rwsub

Post-processing for rewrite cascade, a DTA::CAB::Analyzer::RewriteSub object.

eqphox

Intensional (TAGH-based) phonetic equivalence expander, a DTA::CAB::Analyzer::EqPhoX object.

eqpho

Extensional (corpus-based) phonetic equivalence expander, a DTA::CAB::Analyzer::EqPho object.

eqrw

Extensional rewrite-equivalence expander, a DTA::CAB::Analyzer::EqRW object.

dmoot

Token-level dynamic HMM conflation disambiguator, a DTA::CAB::Analyzer::Moot::DynLex object.

dmootsub

Post-processing for "dmoot" analyzer, a DTA::CAB::Analyzer::DmootSub object.

moot

HMM part-of-speech tagger, a DTA::CAB::Analyzer::Moot object.

mootsub

Post-processing for "moot" tagger, a DTA::CAB::Analyzer::MootSub object.

eqlemma

Extensional (corpus-based) lemma-equivalence class expander, a DTA::CAB::Analyzer::EqLemma object.

clean

Janitor (paranoid removal of internal temporary data), a DTA::CAB::Analyzer::DTAClean object.

setupChains
 $ach = $ach->setupChains();

Setup default named sub-chains in $ach->{chains}. Currently defines a singleton chain sub.NAME for each analyzer key in keys(%$ach), as well as the following non-trivial chains:

 'sub.expand'     =>[@$ach{qw(eqpho eqrw eqlemma)}],
 'sub.sent'       =>[@$ach{qw(dmoot  dmootsub moot  mootsub)}],
 'sub.sent1'      =>[@$ach{qw(dmoot1 dmootsub moot1 mootsub)}],
 'sub.gn'         =>[@$ach{qw(gn-syn gn-isa gn-asi)}],
 'sub.ot'         =>[@$ach{qw(ot-syn ot-isa ot-asi)}],
 ##
 'default.static' =>[@$ach{qw(static)}],
 'default.exlex'  =>[@$ach{qw(exlex)}],
 'default.tokpp'  =>[@$ach{qw(tokpp)}],
 'default.xlit'   =>[@$ach{qw(xlit)}],
 'default.lts'    =>[@$ach{qw(xlit lts)}],
 'default.eqphox' =>[@$ach{qw(tokpp xlit lts eqphox)}],
 'default.morph'  =>[@$ach{qw(tokpp xlit morph)}],
 'default.mlatin' =>[@$ach{qw(tokpp xlit       mlatin)}],
 'default.msafe'  =>[@$ach{qw(tokpp xlit morph mlatin msafe)}],
 'default.langid' =>[@$ach{qw(tokpp xlit morph mlatin msafe langid)}],
 'default.rw'     =>[@$ach{qw(tokpp xlit rw)}],
 'default.rw.safe'=>[@$ach{qw(tokpp xlit                         morph mlatin msafe langid rw)}],
 'default.dmoot'  =>[@$ach{qw(tokpp xlit              lts eqphox morph mlatin msafe langid rw        dmoot)}],
 'default.dmoot1' =>[@$ach{qw(tokpp xlit              lts eqphox morph mlatin msafe langid rw        dmoot1)}],
 'default.moot'   =>[@$ach{qw(tokpp xlit              lts eqphox morph mlatin msafe langid rw        dmoot  dmootsub moot)}],
 'default.moot1'  =>[@$ach{qw(tokpp xlit              lts eqphox morph mlatin msafe langid rw        dmoot1 dmootsub moot1)}],
 'default.lemma'  =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw        dmoot1 dmootsub moot  mootsub)}],
 'default.lemma1' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw        dmoot1 dmootsub moot1 mootsub)}],
 'default.ner'    =>[@$ach{qw(tokpp xlit              lts eqphox morph mlatin msafe langid rw        dmoot  dmootsub moot mootsub ner)}],
 'default.base'   =>[@$ach{qw(static exlex tokpp xlit lts        morph mlatin msafe langid)}],
 'default.type'   =>[@$ach{qw(static exlex tokpp xlit lts        morph mlatin msafe langid rw rwsub)}],
 ##
 'expand.old'     =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw       eqpho eqrw)}],
 'expand.ext'     =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw       eqpho eqrw eqphox)}],
 'expand.all'     =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw       eqpho eqrw eqphox dmoot1 dmootsub moot1 mootsub eqlemma)}],
 'expand.eqpho'   =>[@$ach{qw(static exlex       xlit lts                             eqpho)}],
 'expand.eqrw'    =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw             eqrw)}],
 'expand.eqlemma' =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub eqlemma)}],
 'expand.gn-syn'  =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub gn-syn)}],
 'expand.gn-isa'  =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub gn-isa)}],
 'expand.gn-asi'  =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub gn-asi)}],
 'expand.gn'      =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub gn-syn gn-isa gn-asi)}],
 'expand.ot-syn'  =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub ot-syn)}],
 'expand.ot-isa'  =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub ot-isa)}],
 'expand.ot-asi'  =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub ot-asi)}],
 'expand.ot'      =>[@$ach{qw(static exlex       xlit lts morph mlatin msafe rw                  eqphox dmoot1 dmootsub moot1 mootsub ot-syn ot-isa ot-asi)}],
 ##
 'norm'           =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot  dmootsub moot  mootsub)}],
 'norm1'          =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot1 dmootsub moot1 mootsub)}],
 'ner'            =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot  dmootsub moot  mootsub ner)}],
 'caberr'         =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot  dmootsub moot  mootsub mapclass)}],
 'caberr1'        =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot1 dmootsub moot1 mootsub mapclass)}],
 'all'            =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw rwsub eqpho eqrw eqphox dmoot  dmootsub moot  mootsub eqlemma)}],
 'clean'          =>[@$ach{qw(clean)}],
 ##
 'null'           =>[$ach->{null}],

High-level date-optimized chains norm.RNG, norm1.RNG, lemma.RNG, lemma1.RNG, default.RNG, and expand.RNG are also defined using the date-optimized rewrite cascade rw.RNG in place of the default "generic" cascade rw for each range RNG in 1600-1700, 1700-1800, and 1800-1900.

ensureLoaded
 $bool = $ach->ensureLoaded();

Ensures analysis data is loaded from default files. Inherited DTA::CAB::Chain::Multi override calls ensureChain() before inherited method. Hack copies chain sub-analyzers (rwsub, dmootsub) AFTER loading their own sub-analyzers, setting 'enabled' only then if appropriate.

doAnalyze
 $bool = $anl->doAnalyze(\%opts, $name);

Alias for $anl->can("analyze${name}") && (!exists($opts{"doAnalyze${name}"}) || $opts{"doAnalyze${name}"}). Override checks $anl->{autoClean} flag.

analyzeClean
 $doc = $ach->analyzeClean($doc,\%opts);

Cleanup any temporary data associated with $doc. Chain default calls $a->analyzeClean for each analyzer $a in the chain, then superclass Analyzer->analyzeClean. Local override checks $ach->{autoClean}.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2010-2019 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dta-cab-analyze.perl(1), DTA::CAB::Chain::Multi(3pm), DTA::CAB::Chain(3pm), DTA::CAB::Analyzer(3pm), DTA::CAB(3pm), perl(1), ...

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 514:

L<> starts or ends with whitespace

Around line 539:

L<> starts or ends with whitespace

Around line 552:

'=item' outside of any '=over'