Lingua::LangId::Signature - language guesser: n-gram language "signatures"


NAME

Lingua::LangId::Signature - language guesser: n-gram language "signatures"

(Back to Top)


SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use Lingua::LangId::Signature;
 
 ##========================================================================
 ## Constructors etc.
 
 $sig = $CLASS_OR_OBJECT->new(%opts);
 @noShadowKeys = $obj->noShadowKeys();
 
 ##========================================================================
 ## Methods: compilation
 
 $bool = $sig->compiled();
 $sig = $sig->compile(%opts);
 @keys = $sig->compiledKeys();
 $sig = $sig->uncompile();
 
 ##========================================================================
 ## Methods: access
 
 $f = $sig->f();
 $p = $sig->p();
 $N = $sig->N();
 $H = $sig->H();
 
 ##========================================================================
 ## Methods: training
 
 undef = $sig->sanitizeStringRef(\$ref);
 $sig = $sig->train(%opts);
 $charPdl = $sig->charPdl(%opts);
 $sig = $sig->addPdl($char_value_pdl);
 $sig = $sig->add($ccs_nd_freqs);
 
 ##========================================================================
 ## Methods: sampling & expectation
 
 $p_cumu = $sig->sampleDistPdl();
 $sampleSig = $sig->sampleSig($N);
 $sig  = $sig->trainExpect(%opts);  ##-- scalar context;
 
 ##========================================================================
 ## Methods: low-level comparison
 
 ($wnd,$f1nz,$f2nz) = $sig->fpdls($sig2,%opts);
 $p = $sig->smoothp($fnz);
 $kld = $sig->kld($sig2,%opts);
 $klde = $sig->klde($N);
 $kldp              = $sig->kldp($sig2,%opts);
 $nid = $sig->nid($sig2,%opts);
 $alpha = $sig1->hoeffding($sig2,%opts);
 
 ##========================================================================
 ## Methods: I/O
 
 \%ngramHash = $sig->asHash();
 $sig = $sig->fromHash(\%hash);
 $bool = $sig->saveTextFile($filename_or_fh);
 $bool = $CLASS_OR_OBJECT->loadTextFile($filename_or_fh);

(Back to Top)


DESCRIPTION

Globals & Constants

Variable: @ISA

Lingua::LangId::Signature inherited from Lingua::LangId::Object.

Variable: $LOG2

Constant for log(2), used for computing binary logarithms.

Constructors etc.

new
 $sig = $CLASS_OR_OBJECT->new(%opts);

%opts, %$sig:

 ##-- Modelling Options
 n   => $ngram_window_length,  ##-- signature n-gram length (int, >0, default=2)
 na  => $alphabet_size,        ##-- number of characters in alphabet (default=65536)
 ws  => $ws_str,               ##-- string to use to replace whitespace on train() (default='_')
 #mf  => $f_missing,            ##-- missing pseudo-frequency (distributed on compile, default=1)
 encoding => $encoding,        ##-- assumed encoding for non-utf8 training data (default='UTF-8')
 keepNonAlphaChars => $bool,   ##-- keep non-alphabetic chars on train() ? (default=0: bash to whitespace)
 keepNonAlphaWords => $bool,   ##-- keep non-alphabetic tokens on train() ? (default=0)
 tolower => $bool,             ##-- bash all training input to lower-case? (default=1)
 cutoff  => $plevel,           ##-- cutoff p-level for boolean checks (default=0.01 ~ confidence=0.99)
 ##
 ##-- Initialization Options
 str => $string,               ##-- train initial signature from string
 ##
 ##-- Low-level data
 f   => $f,                    ##-- PDL::CCS::Nd: [@char_ord_ngram] => $f
 ##
 ##-- Compilation Options:
 nsamples=>$n,       ##-- default=100
 lmin    =>$Nmin,    ##-- minimum subsample size (default=16)
 lmax    =>$Nmax,    ##-- minimum subsample size (default=8192)
 lpow    =>$pow,     ##-- exponent for sample size sequence (default=5)
noShadowKeys
 @noShadowKeys = $obj->noShadowKeys();

Returns list of keys not to be passed to $CLASS->new() on shadow(). Override returns:

 qw(f e_kld_sd e_kld_mu e_kld_c)

Methods: compilation

compiled
 $bool = $sig->compiled();

Returns true iff signature has been compiled.

compile
 $sig = $sig->compile(%opts);

Does nothing if already compiled, else calls trainExpect(%opts).

compiledKeys
 @keys = $sig->compiledKeys();

Returns list of keys present if object is compiled.

uncompile
 $sig = $sig->uncompile();

Clears compile cache, if present.

Methods: access

f
 $f = $sig->f();

raw stored frequency pdl

p
 $p = $sig->p();

MLE estimation probability pdl, no smoothing

N
 $N = $sig->N();

Raw frequency sum (scalar).

H
 $H = $sig->H();
 $H = $sig->H($p)

Returns binary entropy scalr. Uses $sig->p() if $p not defined.

Methods: training

sanitizeStringRef
 undef = $sig->sanitizeStringRef(\$ref);

Decodes $$ref by $sig->{encoding}, if defined & utf8 flag is not set for $$ref. Does non-alphabetic bashing and whitespace normalization depending on ($sig->{wantNonAlpha*})

train
 $sig = $sig->train(%opts);

%opts:

 str  => $string_or_stringref,
 file => $file_or_fh,

Train signature from specified source (string, string-ref, named file, or filehandle).

charPdl
 $charPdl = $sig->charPdl(%opts);

%opts:

 str  => $string_or_stringref,
 file => $file_or_fh,

Get character-value vector PDL from specified source. Used by train().

addPdl
 $sig = $sig->addPdl($char_value_pdl);

Low-level training sub: computes character n-grams & adds them to the signature.

add
 $sig = $sig->add($ccs_nd_freqs);
 $sig = $sig->add($sig)

Low-level training sub: add character n-grams to signature.

Methods: sampling & expectation

sampleDistPdl
 $p_cumu = $sig->sampleDistPdl();

Returns $p_cumu, a cumuluative probability pdl for use with vsearch() for random samling of events from $sig.

Use returned pdl as:

 $sample_events = $sig->_whichND->dice_axis(1,random($SampleSize)->vsearch($p_cumu))
sampleSig
 $sampleSig = $sig->sampleSig($N);
 $sampleSig = $sig->sampleSig($N, $p_cumu)

Gets a random sample of total frequency $N from $sig. $p_cumu may be optionally specified to avoid repeated calls of sampleDistPdl().

trainExpect
 $sig  = $sig->trainExpect(%opts);  ##-- scalar context;
 @vals = $sig->trainExpect(%opts);  ##-- array context

Trains expectation curve for kld by random sampling.

Array context return @vals:

 ($len2,$kld2,$fit2,$coefs2,$err_mu,$err_sd)

%opts:

 from    =>$sigsrc,  ##-- sample from $sigsrc (default=$sig)
 # also nsamples,lmin,lmax,lpow (see new())
 # others passed to $sigsrc->sample(), $sig->kld()

sets %$sig keys:

 e_kld_c  => $coefs, ##-- [$a,$b] s.t. E(kld($sig,$sig2)) == $b*($sig2->N**$a)
 e_kld_mu => $mu,    ##-- average error vs. E(kld(...))
 e_kld_sd => $sd,    ##-- stddev error vs. E(kld(...))

Methods: low-level comparison

fpdls
 ($wnd,$f1nz,$f2nz) = $sig->fpdls($sig2,%opts);

Gets pseudo-probability pdls (flat) for any f>0 event indexed by $sig1 or $sig2

%opts:

 how=>$how,  ##-- one of 'union', 'intersect', '1', '2': default: 'union'
smoothp
 $p = $sig->smoothp();
 $p = $sig->smoothp($fnz);

Returns smoothed probability pdl for flat dense frequency values $fnz, which defaults to $sig->{f}->_nzvals().

kld
 $kld = $sig->kld($sig2,%opts);

Gets binary scalar Kullback-Leibler divergence D($sig||$sig2), i.e. $sig is the "real" distribution, $sig2 the encoding distribution.

computes

 D(p||q) = \sum p * log(p/q) = \sum p * (log(p)-log(q))

%opts are passed to e.g. fpdls().

klde
 $klde = $sig->klde($N);
 $klde = $sig->klde($sig2)

Get expected kld (scalar) for signature of size $N (rsp. $sig2->N).

kldp
 $kldp              = $sig->kldp($sig2,%opts);
 ($kld,$klde,$kldp) = $sig->kldp($sig2,%opts)

%opts: passed to e.g. kld(), fpdls(); also

 kld => $raw_kld,

Returned $kldp is cumulative probability of D($sig1||$sig2)-E(D($sig1||sample($sig1,$sig2->N))) i.e. 0 <= $kldp <= 1, where greater values indicate a *worse* fit

nid
 $nid = $sig->nid($sig2,%opts);

Gets (base2) scalar NID($sig,$sig2). Computes

 NID(p,q) = (H(p+q)-min{H(p),H(q)}) / max{H(p),H(q)}
hoeffding
 $alpha = $sig1->hoeffding($sig2,%opts);

Returns expected minimum alpha for which Hoeffding bound doesn't hold. $sig1 distribution is taken as "true" distribution for testing & expectation.

Methods: I/O

asHash
 \%ngramHash = $sig->asHash();

Returns raw frequency hash.

fromHash
 $sig = $sig->fromHash(\%hash);

Assigns $sig signature from raw frequency hash-ref \%hash.

saveTextFile
 $bool = $sig->saveTextFile($filename_or_fh);

Save raw frequency data to a text file or handle.

loadTextFile
 $bool = $CLASS_OR_OBJECT->loadTextFile($filename_or_fh);

Load raw frequency data from a text file or handle.

(Back to Top)


SEE ALSO

Lingua::LangId(3pm)

(Back to Top)


AUTHOR

Bryan Jurish <jurish@uni-potsdam.de>

(Back to Top)


COPYRIGHT AND LICENSE

Copyright (C) 2009 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

(Back to Top)

 Lingua::LangId::Signature - language guesser: n-gram language "signatures"