DTA::CAB::Analyzer::Dict::EqClass - canonical-form-dictionary-based equivalence-class expander


NAME

DTA::CAB::Analyzer::Dict::EqClass - canonical-form-dictionary-based equivalence-class expander

(Back to Top)


SYNOPSIS

 use DTA::CAB::Analyzer::Dict::EqClass;
 
 ##========================================================================
 ## Constructors etc.
 
 $eqc = DTA::CAB::Analyzer::Dict::EqClass->new(%args);
 
 ##========================================================================
 ## Methods: I/O
 
 $bool = $eqc->ensureLoaded();
 $bool = $eqc->dictOk();
 $eqc = $eqc->loadDict($dictfile);
 
 ##========================================================================
 ## Methods: Analysis
 
 $coderef = $anl->getAnalyzeTokenSub();

(Back to Top)


DESCRIPTION

WORK IN PROGRESS

Dictionary-based equivalence-class expander. Reads a full-form dictionary mapping words to equivalence class identifiers (ECIDs aka "canonical forms"; each dictionary word should have at most 1 ECID), builds some internal indices, and at runtime maps input words to a disjunction of all known dictionary words mapped to the same ECID.

Concrete test case: ECIDs are just phonetic forms as returned by (some instance of) DTA::CAB::Analyzer::LTS.

Globals

Variable: @ISA

DTA::CAB::Analyzer::Dict::EqClass inherits from the DTA::CAB::Analyzer::Dict manpage.

Constructors etc.

new
 $eqc = CLASS_OR_OBJ->new(%args);

Constructor.

%args, %$eqc:

 ##-- Analysis I/O
 analysisKey => $key,     ##-- token analysis key (default='eqpho')
 allowRegex  => $re,      ##-- if defined, only tokens with matching text will be analyzed
                          ##   : default=/(?:^[[:alpha:]\-]*[[:alpha:]]+$)|(?:^[[:alpha:]]+[[:alpha:]\-]+$)/
 ##
 ##-- Files
 dictClass => $class,      ##-- class of underlying ECID (LTS) dictionary
 dictOpts  => \%opts,      ##-- if defined, options for (temporary) ECID dict (default: 'DTA::CAB::Analyzer::Dict')
 dictFile  => $filename,   ##-- dictionary filename (loaded with DTA::CAB::Analyzer::Dict->loadDict())
 ##
 ##-- Analysis Objects
 txt2tid  => \%txt2tid,    ##-- map (known) token text to numeric text-ID (1:1)
 tid2pho  => \@tid2pho,    ##-- map text-IDs to phonetic strings (n:1)
 tid2fc   => $tid2f,       ##-- map text-IDs to raw frequencies (n:1)
 #                         ##   : access with $f=vec($id2f, $id, $FREQ_VEC_BITS)
 #id2fc   => $tid2fc,      ##-- map text-IDs to frequency classes; access with $fc=vec($id2f, $id, 8)
 ##                        ##   : $fc = int(log2($f))
 pho2tids => \%pho2tids,   ##-- back-map phonetic strings to text IDs (1:n)
 ##                        ##   : access with @txtids = unpack('L*',$phoStr)

Methods: I/O

ensureLoaded
 $bool = $eqc->ensureLoaded();

Override: ensures analysis data is loaded.

dictOk
 $bool = $eqc->dictOk();

Override: should return false iff dict is undefined or "empty"

loadDict
 $eqc = $eqc->loadDict($dictfile);

Override: load dictionary from $dictfile.

Methods: Analysis

getAnalyzeTokenSub
 $coderef = $anl->getAnalyzeTokenSub();

(Back to Top)


AUTHOR

Bryan Jurish <jurish@bbaw.de>

(Back to Top)


COPYRIGHT AND LICENSE

Copyright (C) 2009 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

(Back to Top)

 DTA::CAB::Analyzer::Dict::EqClass - canonical-form-dictionary-based equivalence-class expander