DTA::CAB::Analyzer::Dict::EqClass - canonical-form-dictionary-based equivalence-class expander |
DTA::CAB::Analyzer::Dict::EqClass - canonical-form-dictionary-based equivalence-class expander
use DTA::CAB::Analyzer::Dict::EqClass; ##======================================================================== ## Constructors etc. $eqc = DTA::CAB::Analyzer::Dict::EqClass->new(%args); ##======================================================================== ## Methods: I/O $bool = $eqc->ensureLoaded(); $bool = $eqc->dictOk(); $eqc = $eqc->loadDict($dictfile); ##======================================================================== ## Methods: Analysis $coderef = $anl->getAnalyzeTokenSub();
WORK IN PROGRESS
Dictionary-based equivalence-class expander. Reads a full-form dictionary mapping words to equivalence class identifiers (ECIDs aka "canonical forms"; each dictionary word should have at most 1 ECID), builds some internal indices, and at runtime maps input words to a disjunction of all known dictionary words mapped to the same ECID.
Concrete test case: ECIDs are just phonetic forms as returned by (some instance of) DTA::CAB::Analyzer::LTS.
DTA::CAB::Analyzer::Dict::EqClass inherits from the DTA::CAB::Analyzer::Dict manpage.
$eqc = CLASS_OR_OBJ->new(%args);
Constructor.
%args, %$eqc:
##-- Analysis I/O analysisKey => $key, ##-- token analysis key (default='eqpho') allowRegex => $re, ##-- if defined, only tokens with matching text will be analyzed ## : default=/(?:^[[:alpha:]\-]*[[:alpha:]]+$)|(?:^[[:alpha:]]+[[:alpha:]\-]+$)/ ## ##-- Files dictClass => $class, ##-- class of underlying ECID (LTS) dictionary dictOpts => \%opts, ##-- if defined, options for (temporary) ECID dict (default: 'DTA::CAB::Analyzer::Dict') dictFile => $filename, ##-- dictionary filename (loaded with DTA::CAB::Analyzer::Dict->loadDict()) ## ##-- Analysis Objects txt2tid => \%txt2tid, ##-- map (known) token text to numeric text-ID (1:1) tid2pho => \@tid2pho, ##-- map text-IDs to phonetic strings (n:1) tid2fc => $tid2f, ##-- map text-IDs to raw frequencies (n:1) # ## : access with $f=vec($id2f, $id, $FREQ_VEC_BITS) #id2fc => $tid2fc, ##-- map text-IDs to frequency classes; access with $fc=vec($id2f, $id, 8) ## ## : $fc = int(log2($f)) pho2tids => \%pho2tids, ##-- back-map phonetic strings to text IDs (1:n) ## ## : access with @txtids = unpack('L*',$phoStr)
$bool = $eqc->ensureLoaded();
Override: ensures analysis data is loaded.
$bool = $eqc->dictOk();
Override: should return false iff dict is undefined or "empty"
$eqc = $eqc->loadDict($dictfile);
Override: load dictionary from $dictfile.
$coderef = $anl->getAnalyzeTokenSub();
returned sub is callable as:
$tok = $coderef->($tok,\%analyzeOptions)
analyzes phonetic source $opts{phoSrc}, defaults to $tok->{ $eqc->{inputKey} }[0]{hi}
falls back to analysis of text $opts{src} rsp. $tok->{text}
sets (for $key=$anl->{analysisKey}): $tok->{$key} = [ $eqTxt1, $eqText2, ... ]
Bryan Jurish <jurish@bbaw.de>
Copyright (C) 2009 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
DTA::CAB::Analyzer::Dict::EqClass - canonical-form-dictionary-based equivalence-class expander |