DTA::CAB::Chain::DTA - Deutsches Textarchiv canonicalization chain class
use DTA::CAB::Chain::DTA;
##========================================================================
## Methods
$obj = CLASS_OR_OBJ->new(%args);
$ach = $ach->setupChains();
$bool = $ach->ensureLoaded();
$bool = $anl->doAnalyze(\%opts, $name);
$doc = $ach->analyzeClean($doc,\%opts);
DTA::CAB::Chain::DTA is the DTA::CAB::Analyzer subclass implementing the robust orthographic canonicalization cascade used in the Deutsches Textarchiv project. This class inherits from DTA::CAB::Chain::Multi. See the "setupChains" method for a list of supported sub-chains and the corresponding analyers.
$obj = CLASS_OR_OBJ->new(%args);
%$obj, %args:
##-- paranoia
autoClean => 0, ##-- always run 'clean' analyzer regardless of options; checked in both doAnalyze(), analyzeClean()
defaultChain => 'default',
##
##-- overrides
chains => undef, ##-- see setupChains() method
chain => undef, ##-- see setupChains() method
Additionally, the following sub-analyzers are defined as fields of %$obj:
Token preprocessor, a DTA::CAB::Analyzer::TokPP object.
Transliterator, a DTA::CAB::Analyzer::Unicruft object.
Phonetizer (Letter-to-Sound mapper), a DTA::CAB::Analyzer::LTS object.
Morphological analyzer (TAGH), a DTA::CAB::Analyzer::Morph object.
Latin pseudo-morphology, a DTA::CAB::Analyzer::Morph::Latin object.
Morphological security heuristics, a DTA::CAB::Analyzer::MorphSafe object.
Weighted finite-state rewrite cascade, a DTA::CAB::Analyzer::Rewrite object.
Date-optimized variants rw.1600-1700
, rw.1700-1800
, and rw.1800-1900
may also be included.
Post-processing for rewrite cascade, a DTA::CAB::Analyzer::RewriteSub object.
Intensional (TAGH-based) phonetic equivalence expander, a DTA::CAB::Analyzer::EqPhoX object.
Extensional (corpus-based) phonetic equivalence expander, a DTA::CAB::Analyzer::EqPho object.
Extensional rewrite-equivalence expander, a DTA::CAB::Analyzer::EqRW object.
Token-level dynamic HMM conflation disambiguator, a DTA::CAB::Analyzer::Moot::DynLex object.
Post-processing for "dmoot" analyzer, a DTA::CAB::Analyzer::DmootSub object.
HMM part-of-speech tagger, a DTA::CAB::Analyzer::Moot object.
Post-processing for "moot" tagger, a DTA::CAB::Analyzer::MootSub object.
Extensional (corpus-based) lemma-equivalence class expander, a DTA::CAB::Analyzer::EqLemma object.
Janitor (paranoid removal of internal temporary data), a DTA::CAB::Analyzer::DTAClean object.
$ach = $ach->setupChains();
Setup default named sub-chains in $ach->{chains}. Currently defines a singleton chain sub.NAME
for each analyzer key in keys(%$ach), as well as the following non-trivial chains:
'sub.expand' =>[@$ach{qw(eqpho eqrw eqlemma)}],
'sub.sent' =>[@$ach{qw(dmoot dmootsub moot mootsub)}],
'sub.sent1' =>[@$ach{qw(dmoot1 dmootsub moot1 mootsub)}],
'sub.gn' =>[@$ach{qw(gn-syn gn-isa gn-asi)}],
'sub.ot' =>[@$ach{qw(ot-syn ot-isa ot-asi)}],
##
'default.static' =>[@$ach{qw(static)}],
'default.exlex' =>[@$ach{qw(exlex)}],
'default.tokpp' =>[@$ach{qw(tokpp)}],
'default.xlit' =>[@$ach{qw(xlit)}],
'default.lts' =>[@$ach{qw(xlit lts)}],
'default.eqphox' =>[@$ach{qw(tokpp xlit lts eqphox)}],
'default.morph' =>[@$ach{qw(tokpp xlit morph)}],
'default.mlatin' =>[@$ach{qw(tokpp xlit mlatin)}],
'default.msafe' =>[@$ach{qw(tokpp xlit morph mlatin msafe)}],
'default.langid' =>[@$ach{qw(tokpp xlit morph mlatin msafe langid)}],
'default.rw' =>[@$ach{qw(tokpp xlit rw)}],
'default.rw.safe'=>[@$ach{qw(tokpp xlit morph mlatin msafe langid rw)}],
'default.dmoot' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot)}],
'default.dmoot1' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1)}],
'default.moot' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot dmootsub moot)}],
'default.moot1' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1 dmootsub moot1)}],
'default.lemma' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1 dmootsub moot mootsub)}],
'default.lemma1' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1 dmootsub moot1 mootsub)}],
'default.ner' =>[@$ach{qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot dmootsub moot mootsub ner)}],
'default.base' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid)}],
'default.type' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw rwsub)}],
##
'expand.old' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqpho eqrw)}],
'expand.ext' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqpho eqrw eqphox)}],
'expand.all' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqpho eqrw eqphox dmoot1 dmootsub moot1 mootsub eqlemma)}],
'expand.eqpho' =>[@$ach{qw(static exlex xlit lts eqpho)}],
'expand.eqrw' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqrw)}],
'expand.eqlemma' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub eqlemma)}],
'expand.gn-syn' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-syn)}],
'expand.gn-isa' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-isa)}],
'expand.gn-asi' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-asi)}],
'expand.gn' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-syn gn-isa gn-asi)}],
'expand.ot-syn' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-syn)}],
'expand.ot-isa' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-isa)}],
'expand.ot-asi' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-asi)}],
'expand.ot' =>[@$ach{qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-syn ot-isa ot-asi)}],
##
'norm' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot dmootsub moot mootsub)}],
'norm1' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot1 dmootsub moot1 mootsub)}],
'ner' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot dmootsub moot mootsub ner)}],
'caberr' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot dmootsub moot mootsub mapclass)}],
'caberr1' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot1 dmootsub moot1 mootsub mapclass)}],
'all' =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw rwsub eqpho eqrw eqphox dmoot dmootsub moot mootsub eqlemma)}],
'clean' =>[@$ach{qw(clean)}],
##
'null' =>[$ach->{null}],
High-level date-optimized chains norm.RNG
, norm1.RNG
, lemma.RNG
, lemma1.RNG
, default.RNG
, and expand.RNG
are also defined using the date-optimized rewrite cascade rw.RNG
in place of the default "generic" cascade rw
for each range RNG in 1600-1700
, 1700-1800
, and 1800-1900
.
$bool = $ach->ensureLoaded();
Ensures analysis data is loaded from default files. Inherited DTA::CAB::Chain::Multi override calls ensureChain() before inherited method. Hack copies chain sub-analyzers (rwsub, dmootsub) AFTER loading their own sub-analyzers, setting 'enabled' only then if appropriate.
$bool = $anl->doAnalyze(\%opts, $name);
Alias for $anl->can("analyze${name}") && (!exists($opts{"doAnalyze${name}"}) || $opts{"doAnalyze${name}"}). Override checks $anl->{autoClean} flag.
$doc = $ach->analyzeClean($doc,\%opts);
Cleanup any temporary data associated with $doc. Chain default calls $a->analyzeClean for each analyzer $a in the chain, then superclass Analyzer->analyzeClean. Local override checks $ach->{autoClean}.
Bryan Jurish <moocow@cpan.org>
Copyright (C) 2010-2019 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.
dta-cab-analyze.perl(1), DTA::CAB::Chain::Multi(3pm), DTA::CAB::Chain(3pm), DTA::CAB::Analyzer(3pm), DTA::CAB(3pm), perl(1), ...
Hey! The above document had some coding errors, which are explained below:
L<> starts or ends with whitespace
L<> starts or ends with whitespace
'=item' outside of any '=over'