Lingua::LangId::Signature - language guesser: n-gram language "signatures" |
Lingua::LangId::Signature - language guesser: n-gram language "signatures"
##======================================================================== ## PRELIMINARIES use Lingua::LangId::Signature; ##======================================================================== ## Constructors etc. $sig = $CLASS_OR_OBJECT->new(%opts); @noShadowKeys = $obj->noShadowKeys(); ##======================================================================== ## Methods: compilation $bool = $sig->compiled(); $sig = $sig->compile(%opts); @keys = $sig->compiledKeys(); $sig = $sig->uncompile(); ##======================================================================== ## Methods: access $f = $sig->f(); $p = $sig->p(); $N = $sig->N(); $H = $sig->H(); ##======================================================================== ## Methods: training undef = $sig->sanitizeStringRef(\$ref); $sig = $sig->train(%opts); $charPdl = $sig->charPdl(%opts); $sig = $sig->addPdl($char_value_pdl); $sig = $sig->add($ccs_nd_freqs); ##======================================================================== ## Methods: sampling & expectation $p_cumu = $sig->sampleDistPdl(); $sampleSig = $sig->sampleSig($N); $sig = $sig->trainExpect(%opts); ##-- scalar context; ##======================================================================== ## Methods: low-level comparison ($wnd,$f1nz,$f2nz) = $sig->fpdls($sig2,%opts); $p = $sig->smoothp($fnz); $kld = $sig->kld($sig2,%opts); $klde = $sig->klde($N); $kldp = $sig->kldp($sig2,%opts); $nid = $sig->nid($sig2,%opts); $alpha = $sig1->hoeffding($sig2,%opts); ##======================================================================== ## Methods: I/O \%ngramHash = $sig->asHash(); $sig = $sig->fromHash(\%hash); $bool = $sig->saveTextFile($filename_or_fh); $bool = $CLASS_OR_OBJECT->loadTextFile($filename_or_fh);
Lingua::LangId::Signature inherited from Lingua::LangId::Object.
Constant for log(2)
, used for computing binary logarithms.
$sig = $CLASS_OR_OBJECT->new(%opts);
%opts, %$sig:
##-- Modelling Options n => $ngram_window_length, ##-- signature n-gram length (int, >0, default=2) na => $alphabet_size, ##-- number of characters in alphabet (default=65536) ws => $ws_str, ##-- string to use to replace whitespace on train() (default='_') #mf => $f_missing, ##-- missing pseudo-frequency (distributed on compile, default=1) encoding => $encoding, ##-- assumed encoding for non-utf8 training data (default='UTF-8') keepNonAlphaChars => $bool, ##-- keep non-alphabetic chars on train() ? (default=0: bash to whitespace) keepNonAlphaWords => $bool, ##-- keep non-alphabetic tokens on train() ? (default=0) tolower => $bool, ##-- bash all training input to lower-case? (default=1) cutoff => $plevel, ##-- cutoff p-level for boolean checks (default=0.01 ~ confidence=0.99) ## ##-- Initialization Options str => $string, ##-- train initial signature from string ## ##-- Low-level data f => $f, ##-- PDL::CCS::Nd: [@char_ord_ngram] => $f ## ##-- Compilation Options: nsamples=>$n, ##-- default=100 lmin =>$Nmin, ##-- minimum subsample size (default=16) lmax =>$Nmax, ##-- minimum subsample size (default=8192) lpow =>$pow, ##-- exponent for sample size sequence (default=5)
@noShadowKeys = $obj->noShadowKeys();
Returns list of keys not to be passed to $CLASS->new()
on shadow()
.
Override returns:
qw(f e_kld_sd e_kld_mu e_kld_c)
$bool = $sig->compiled();
Returns true iff signature has been compiled.
$sig = $sig->compile(%opts);
Does nothing if already compiled, else calls trainExpect(%opts)
.
@keys = $sig->compiledKeys();
Returns list of keys present if object is compiled.
$sig = $sig->uncompile();
Clears compile cache, if present.
$f = $sig->f();
raw stored frequency pdl
$p = $sig->p();
MLE estimation probability pdl, no smoothing
$N = $sig->N();
Raw frequency sum (scalar).
$H = $sig->H(); $H = $sig->H($p)
Returns binary entropy scalr. Uses $sig->p() if $p not defined.
undef = $sig->sanitizeStringRef(\$ref);
Decodes $$ref by $sig->{encoding}, if defined & utf8 flag is not set for $$ref. Does non-alphabetic bashing and whitespace normalization depending on ($sig->{wantNonAlpha*})
$sig = $sig->train(%opts);
%opts:
str => $string_or_stringref, file => $file_or_fh,
Train signature from specified source (string, string-ref, named file, or filehandle).
$charPdl = $sig->charPdl(%opts);
%opts:
str => $string_or_stringref, file => $file_or_fh,
Get character-value vector PDL from specified source.
Used by train()
.
$sig = $sig->addPdl($char_value_pdl);
Low-level training sub: computes character n-grams & adds them to the signature.
$sig = $sig->add($ccs_nd_freqs); $sig = $sig->add($sig)
Low-level training sub: add character n-grams to signature.
$p_cumu = $sig->sampleDistPdl();
Returns $p_cumu, a cumuluative probability pdl for use with vsearch()
for random samling of events from $sig.
Use returned pdl as:
$sample_events = $sig->_whichND->dice_axis(1,random($SampleSize)->vsearch($p_cumu))
$sampleSig = $sig->sampleSig($N); $sampleSig = $sig->sampleSig($N, $p_cumu)
Gets a random sample of total frequency $N from $sig.
$p_cumu may be optionally specified to avoid repeated calls of sampleDistPdl()
.
$sig = $sig->trainExpect(%opts); ##-- scalar context; @vals = $sig->trainExpect(%opts); ##-- array context
Trains expectation curve for kld by random sampling.
Array context return @vals:
($len2,$kld2,$fit2,$coefs2,$err_mu,$err_sd)
%opts:
from =>$sigsrc, ##-- sample from $sigsrc (default=$sig) # also nsamples,lmin,lmax,lpow (see new()) # others passed to $sigsrc->sample(), $sig->kld()
sets %$sig keys:
e_kld_c => $coefs, ##-- [$a,$b] s.t. E(kld($sig,$sig2)) == $b*($sig2->N**$a) e_kld_mu => $mu, ##-- average error vs. E(kld(...)) e_kld_sd => $sd, ##-- stddev error vs. E(kld(...))
($wnd,$f1nz,$f2nz) = $sig->fpdls($sig2,%opts);
Gets pseudo-probability pdls (flat) for any f>0 event indexed by $sig1 or $sig2
%opts:
how=>$how, ##-- one of 'union', 'intersect', '1', '2': default: 'union'
$p = $sig->smoothp(); $p = $sig->smoothp($fnz);
Returns smoothed probability pdl for flat dense frequency values $fnz,
which defaults to $sig->{f}->_nzvals()
.
$kld = $sig->kld($sig2,%opts);
Gets binary scalar Kullback-Leibler divergence D($sig||$sig2), i.e. $sig is the "real" distribution, $sig2 the encoding distribution.
computes
D(p||q) = \sum p * log(p/q) = \sum p * (log(p)-log(q))
%opts are passed to e.g. fpdls()
.
$klde = $sig->klde($N); $klde = $sig->klde($sig2)
Get expected kld (scalar) for signature of size $N (rsp. $sig2->N).
$kldp = $sig->kldp($sig2,%opts); ($kld,$klde,$kldp) = $sig->kldp($sig2,%opts)
%opts: passed to e.g. kld()
, fpdls()
; also
kld => $raw_kld,
Returned $kldp is cumulative probability of D($sig1||$sig2)-E(D($sig1||sample($sig1,$sig2->N))) i.e. 0 <= $kldp <= 1, where greater values indicate a *worse* fit
$nid = $sig->nid($sig2,%opts);
Gets (base2) scalar NID($sig,$sig2)
.
Computes
NID(p,q) = (H(p+q)-min{H(p),H(q)}) / max{H(p),H(q)}
$alpha = $sig1->hoeffding($sig2,%opts);
Returns expected minimum alpha for which Hoeffding bound doesn't hold. $sig1 distribution is taken as "true" distribution for testing & expectation.
\%ngramHash = $sig->asHash();
Returns raw frequency hash.
$sig = $sig->fromHash(\%hash);
Assigns $sig signature from raw frequency hash-ref \%hash.
$bool = $sig->saveTextFile($filename_or_fh);
Save raw frequency data to a text file or handle.
$bool = $CLASS_OR_OBJECT->loadTextFile($filename_or_fh);
Load raw frequency data from a text file or handle.
Lingua::LangId(3pm)
Bryan Jurish <jurish@uni-potsdam.de>
Copyright (C) 2009 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
Lingua::LangId::Signature - language guesser: n-gram language "signatures" |