Main Page   Namespace List   Class Hierarchy   Alphabetical List   Compound List   File List   Namespace Members   Compound Members   File Members  

moot::mootHMM Class Reference

1st-order Hidden Markov Model Tagger/Disambiguator class. More...

#include <mootHMM.h>

Collaboration diagram for moot::mootHMM:

Collaboration graph
[legend]
List of all members.

Public Types

Atomic Types
Lexical class types
Lookup-Table Types
Viterbi Trellis Types

Public Methods

Constructor / Destructor
Reset / clear
Binary load / save
Accessors
Compilation / Initialization
Top-level Tagging Interface
Mid-level Viterbi algorithm API
Low/Mid-Level Viterbi Path Utilties
Low-level Viterbi iteration utilities
Low-level Trash-stack Utilities
ID Lookup
Probability Lookup
Error reporting
Debugging

Public Attributes

I/O-related Flags
Useful Constants
Smoothing Constants
ID Lookup Tables
Probability Lookup Tables
Viterbi Trellis Data
Statistics / Performance Tracking

Protected Attributes

Low-level data: trash stacks
Low-level data: temporaries

Detailed Description

All probabilities are stored internally as logarithms: this saves us a bit of runtime, and helps avoid datatype underflows.


Member Typedef Documentation

typedef ProbT* moot::mootHMM::BigramProbTable
 

Type for uni- and bigram probability lookup table: c-style 2d array: bigram probabilites log(p(tagid|ptagid)) indexed by ((ntags*ptagid)+tagid) , and unigram probabilities log(p(tagid)) indexed by tagid .

This winds up being a rather sparse table, but it should fit well in memory even for large (~= 2K tags) tagsets on contemporary machines, and lookup is Just Plain Quick.

typedef mootEnumID moot::mootHMM::ClassID
 

Typedef for a lexical ClassID. Zero indicates either a previously unknown class or the empty class.

typedef mootEnum<LexClass, LexClassHash, LexClassEqual> moot::mootHMM::ClassIDTable
 

Typedef for class-id lookup table

typedef set<TagID> moot::mootHMM::LexClass
 

Type for a lexical-class aka "ambiguity class". Intuitively, the lexical class associated with a given token is just the set of all a priori possible PoS tags for that that token.

typedef LexProbSubTable moot::mootHMM::LexClassProbSubTable
 

Type for lexical-class probability lookup subtable: tagid=>log(p(·|tagid))

typedef LexProbTable moot::mootHMM::LexClassProbTable
 

Type for lexical-class probability lookup table: classid=>(tagid=>log(p(classid|tagid))) Really just an alias for LexProbSubtable: at some point, we should capitalize on this and make things spiffy boffo stomach-lurching fast, but that requires more information than is currently stored in our models (specifically, foreknowledge of the token->class mapping for known tokens), and an assumption that this mapping is static, which it very well might not be at some vaguely imagined unspecified future point in time.

typedef AssocVector<TagID,ProbT> moot::mootHMM::LexProbSubTable
 

Type for lexical probability lookup subtable: tagid=>log(p(·|tagid))

typedef vector<LexProbSubTable> moot::mootHMM::LexProbTable
 

Type for lexical probability lookup table: tokid=>(tagid=>log(p(tokid|tagid)))

typedef mootEnumID moot::mootHMM::TagID
 

Type for a tag-identifier. Zero indicates an unknown tag.

typedef mootEnum<mootTagString, hash<mootTagString>, equal_to<mootTagString> > moot::mootHMM::TagIDTable
 

Typedef for tag-id lookup table

typedef mootEnumID moot::mootHMM::TokID
 

Type for a token-identider. Zero indicates an unknown token.

typedef mootEnum<mootTokString, hash<mootTokString>, equal_to<mootTokString> > moot::mootHMM::TokIDTable
 

Typedef for token-id lookup table

typedef ViterbiNode moot::mootHMM::ViterbiRow
 


Member Enumeration Documentation

enum moot::mootHMM::VerbosityLevel
 

Symbolic verbosity level typedef

Enumeration values:
vlSilent  Be silent
vlErrors  Report errors
vlWarnings  Report warnings
vlProgress  Report progess
vlEverything  Report everything we can


Constructor & Destructor Documentation

moot::mootHMM::mootHMM void   
 

Default constructor

moot::mootHMM::~mootHMM void    [inline]
 

Destructor


Member Function Documentation

bool moot::mootHMM::_bindump mootio::mostream   obs,
const char *    filename = 0
 

Low-level: save guts to a binary stream

bool moot::mootHMM::_binload mootio::mistream   ibs,
const char *    filename = 0
 

Low-level: load guts from a binary stream

void moot::mootHMM::_viterbi_step_fallback TokID    tokid,
ViterbiColumn   col
 

Step a single Viterbi iteration, last-ditch effort: consider all tags in tagset. Implicitly called by other viterbi_step() methods.

void moot::mootHMM::assign_ids_cf const mootClassfreqs   classfreqs
 

Assign IDs for classes and tags from classfreqs: called by compile()

void moot::mootHMM::assign_ids_lf const mootLexfreqs   lexfreqs
 

Assign IDs for tokens and tags from lexfreqs: called by compile()

void moot::mootHMM::assign_ids_ng const mootNgrams   ngrams
 

Assign IDs for tags from ngrams: called by compile()

bool moot::mootHMM::build_suffix_trie const mootLexfreqs   lf,
const mootNgrams   ng,
bool    verbose = false
[inline]
 

Build suffix trie for unknown-word handling: NOT called by compile().

void moot::mootHMM::carp char *    fmt,
...   
 

Error reporting

ClassID moot::mootHMM::class2id const LexClass   lclass,
bool    autopopulate = true,
bool    autocreate = true
[inline]
 

Lookup the ClassID for the lexical-class lclass.

Parameters:
autopopulate  if true, new classes will be autopopulated with uniform distributions (implies autocreate).
autocreate  if true, new classes will be created and assigned class-ids.

const ProbT moot::mootHMM::classp const LexClass   lclass,
const mootTagString    tag
const [inline]
 

\bold DEPRECATED

Looks up and returns lexical-class probability: p(class|tag) given class, tag -- no id auto-generation is performed!

const ProbT moot::mootHMM::classp const ClassID    classid,
const TagID    tagid
const [inline]
 

Looks up and returns lexical-class probability: p(classid|tagid)

void moot::mootHMM::clear bool    wipe_everything = true,
bool    unlogify = false
 

Reset/clear the object, freeing all dynamic data structures. If 'wipe_everything' is false, ID-tables and constants will spared.

bool moot::mootHMM::compile const mootLexfreqs   lexfreqs,
const mootNgrams   ngrams,
const mootClassfreqs   classfreqs,
const mootTagString   start_tag_str = "__$"
 

Compile probabilites from raw frequency counts in 'lexfreqs' and 'ngrams'. Returns false on failure.

void moot::mootHMM::compile_unknown_lexclass const mootClassfreqs   classfreqs
 

Compile "unknown" lexical class : called by compile()

bool moot::mootHMM::compute_logprobs void   
 

Pre-compute runtime log-probability tables: NOT called by compile().

bool moot::mootHMM::estimate_clambdas const mootClassfreqs   cf
 

Estimate class smoothing constants: NOT called by compile().

bool moot::mootHMM::estimate_lambdas const mootNgrams   ngrams
 

Estimate ngram-smoothing constants: NOT called by compile().

bool moot::mootHMM::estimate_wlambdas const mootLexfreqs   lf
 

Estimate lexical smoothing constants: NOT called by compile().

bool moot::mootHMM::load mootio::mistream   ibs,
const char *    filename = 0
 

Load from a binary stream

bool moot::mootHMM::load const char *    filename = 0
 

Load from a binary file

bool moot::mootHMM::load_model const string &    modelname,
const mootTagString   start_tag_str = "__$",
const char *    myname = "mootHMM::load_model()",
bool    do_estimate_nglambdas = true,
bool    do_estimate_wlambdas = true,
bool    do_estimate_clambdas = true,
bool    do_build_suffix_trie = true,
bool    do_compute_logprobs = true
 

Top-level: load and compile a single model, and estimate all smoothing constants. Returns true on success, false on failure.

Parameters:
modelname  is a model name following the conventions in mootfiles(5)
start_tag_str  is the string form of the boundary tag.
myname  name to use for warnings/errors/info
If you want to load multiple models, you will need to first load the raw-freqency objects, then call the compile(), estimate_*(), build_suffix_trie(), and compute_logprobs() methods yourself.

bool moot::mootHMM::save mootio::mostream   obs,
const char *    filename = 0
 

Save to a binary stream

bool moot::mootHMM::save const char *    filename,
int    compression_level = -1
 

Save to a binary file

void moot::mootHMM::tag_io TokenReader   reader,
TokenWriter   writer
[inline]
 

Top-level tagging interface: TokenIO layer

void moot::mootHMM::tag_mark_best mootSentence   sentence
 

Mid-level tagging interface: mark 'best' tags in sentence structure: fills besttag datum of each mootToken element of sentence. Before calling this method, you should have done following:

  • called viterbi_clear() to initialize the Viterbi trellis.
  • called viterbi_step(mootToken) once for each element of sentence.
  • called viterbi_finish() to push the boundary tag onto the Viterbi trellis.

void moot::mootHMM::tag_sentence mootSentence   sentence [inline]
 

Top-level tagging interface: mootSentence input & output (destructive). Calling this method will (re-)populate the besttag datum in the sentence argument.

const ProbT moot::mootHMM::tagp const mootTagString   prevtag,
const mootTagString   tag
const [inline]
 

\bold DEPRECATED

Looks up and returns bigram probability: log(p(tag|prevtag)), string-version.

const ProbT moot::mootHMM::tagp const TagID    prevtagid,
const TagID    tagid
const [inline]
 

Looks up and returns bigram (log-)probability: log(p(tagid|prevtagid)), given tagid, prevtagid.

const ProbT moot::mootHMM::tagp const mootTagString   tag const [inline]
 

\bold DEPRECATED

Looks up and returns unigram (log-)probability: log(p(tag)), string-version.

const ProbT moot::mootHMM::tagp const TagID    tagid const [inline]
 

Looks up and returns unigram probability: p(tagid).

LexClass* moot::mootHMM::tagset2lexclass const mootTagSet   tagset,
LexClass   lclass = 0,
bool    add_tagids = false
[inline]
 

Convert string-form tagsets to lexical classes. If add_tagids is true, then tag-IDs will be assigned as needed for the element tags. If you specify NULL as the lexical class, a new one will be allocated and returned (you must then delete it yourself!)

Note:
lclass is NOT cleared by this method.

TokID moot::mootHMM::token2id const mootTokString   token const [inline]
 

Get the TokID for a given token, using type-based lookup

void moot::mootHMM::txtdump FILE *    file
 

Debugging method: dump basic HMM contents to a text file.

void moot::mootHMM::unknown_class_name const mootTagSet   tagset [inline]
 

void moot::mootHMM::unknown_tag_name const mootTokString   name [inline]
 

Set the unknown tag : this tag should never appear anyways

void moot::mootHMM::unknown_token_name const mootTokString   name [inline]
 

Set the unknown token name : UNSAFE!

ViterbiNode* moot::mootHMM::viterbi_best_node TagID    tagid [inline]
 

Get best current path from Viterbi state tables resulting in tag 'tagid'. The best full path to this node can be reconstructed (in reverse order) by traversing the 'pth_prev' pointers until (pth_prev==NULL).

ViterbiNode* moot::mootHMM::viterbi_best_node void    [inline]
 

Get best current node from Viterbi state tables, considering all possible current tags (all rows in current column). The best full path to this node can be reconstructed (in reverse order) by traversing the pth_prev pointers until (pth_prev==NULL) .

ViterbiPathNode* moot::mootHMM::viterbi_best_path const mootTagString   tagstr [inline]
 

Get current best path (in input order), considering only tag 'tag'

ViterbiPathNode* moot::mootHMM::viterbi_best_path TagID    tagid [inline]
 

Get current best path (in input order), considering only tag 'tagid'

ViterbiPathNode* moot::mootHMM::viterbi_best_path void    [inline]
 

Get current best path (in input order), considering all current tags

void moot::mootHMM::viterbi_clear void   
 

Clear Viterbi state table(s)

void moot::mootHMM::viterbi_clear_bestpath void    [inline]
 

Clear internal @vbestpath temporary

bool moot::mootHMM::viterbi_column_ok const ViterbiColumn   col const [inline]
 

Returns true iff @col is a valid (non-empty) Viterbi trellis column

void moot::mootHMM::viterbi_finish void    [inline]
 

Run final Viterbi iteration, using instance datum start_tagid as the final tag.

void moot::mootHMM::viterbi_finish const TagID    final_tagid [inline]
 

Run final Viterbi iteration, using final_tagid as the boundary tag

ViterbiColumn* moot::mootHMM::viterbi_get_column void    [inline]
 

Returns a pointer to an unused ViterbiColumn, possibly allocating a new one.

ViterbiNode* moot::mootHMM::viterbi_get_node void    [inline]
 

Returns a pointer to an unused ViterbiNode, possibly allocating a new one.

ViterbiPathNode* moot::mootHMM::viterbi_get_pathnode void    [inline]
 

Returns a pointer to an unused ViterbiPathNode, possibly allocating a new one.

ViterbiRow* moot::mootHMM::viterbi_get_row void    [inline]
 

Returns a pointer to an unused ViterbiRow, possibly allocating a new one.

ViterbiPathNode* moot::mootHMM::viterbi_node_path ViterbiNode   node [inline]
 

Useful utility: build a path (in input order) from a ViterbiNode. See caveats for 'struct ViterbiPathNode' -- return value is non-const for easy iteration.

Uses 'vbestpath' to store constructed path.

ViterbiColumn* moot::mootHMM::viterbi_populate_row TagID    curtagid,
ProbT    wordpr = 0.0,
ViterbiColumn   col = 0,
ProbT    probmin = 1.0
[inline]
 

Get and populate a new Viterbi-trellis row in column @col for destination Tag-ID @curtagid with lexical (log-)probability @wordpr. If @col is NULL (the default), a new column will be allocated. Returns a pointer to the trellis column, or NULL on failure.

If specified, @probmin can be used to override beam-pruning for non-NULL columns.

void moot::mootHMM::viterbi_step const mootTokString   toktext,
const mootTagString   tag
[inline]
 

\bold DEPRECATED

Step a single Viterbi iteration, considering only the tag tag : string version.

void moot::mootHMM::viterbi_step TokID    tokid,
TagID    tagid,
ViterbiColumn   col = 0
 

Step a single Viterbi iteration, considering only the tag tagid.

void moot::mootHMM::viterbi_step const mootTokString   token_text,
const set< mootTagString > &    tags
[inline]
 

\bold DEPRECATED

Step a single Viterbi iteration, considering only the tags in tags.

void moot::mootHMM::viterbi_step const mootTokString   token_text [inline]
 

\bold DEPRECATED in favor of viterbi_step(mootToken)

Step a single Viterbi iteration, string version. Really just a wrapper for viterbi_step(TokID tokid).

void moot::mootHMM::viterbi_step TokID    tokid,
const mootTokString   toktext = ""
 

Step a single Viterbi iteration, considering all known tags for tokid as possible analyses. May be faster in cases where no futher information (i.e. set of possible tags) is available.

void moot::mootHMM::viterbi_step TokID    tokid,
ClassID    classid,
const LexClass   lclass,
const mootTokString   toktext = ""
 

Step a single Viterbi iteration, considering only the tags in lclass

void moot::mootHMM::viterbi_step TokID    tokid,
const LexClass   lexclass,
const mootTokString   toktext = ""
[inline]
 

Step a single Viterbi iteration, considering only the tags in lexclass -- useful if you have some a priori information on the token.

void moot::mootHMM::viterbi_step const mootToken   token [inline]
 

Step a single Viterbi iteration, mootToken version. Really just a wrapper for viterbi_step(TokID,set<TagID>).

void moot::mootHMM::viterbi_txtdump TokenWriter   w,
int    ncols = 0
 

Debugging method: dump entire Viterbi trellis to a text file

void moot::mootHMM::viterbi_txtdump_col TokenWriter   w,
ViterbiColumn   col,
int    colnum = 0
 

Debugging method: dump single Viterbi column to a text file

const ProbT moot::mootHMM::wordp const mootTokString    token,
const mootTagString    tag
const [inline]
 

\bold DEPRECATED

Looks up and returns lexical probability: p(token|tag) given token, tag.

const ProbT moot::mootHMM::wordp const TokID    tokid,
const TagID    tagid
const [inline]
 

Looks up and returns lexical probability: p(tokid|tagid) given tokid, tagid.


Member Data Documentation

ProbT moot::mootHMM::beamwd
 

(log) Beam-search width: during Viterbi search, heuristically prune paths whose probability is <= 1/beamwd*p_best A value of zero indicates no beam pruning.

ProbT moot::mootHMM::clambda0
 

(log) Smoothing constant for class probabilities

ProbT moot::mootHMM::clambda1
 

(log) Smoothing constant for class probabilities

ClassIDTable moot::mootHMM::classids
 

Class-ID lookup table

TokID moot::mootHMM::flavids[NTokFlavors]
 

LexClassProbTable moot::mootHMM::lcprobs
 

Lexical-class probability lookup table

LexProbTable moot::mootHMM::lexprobs
 

Lexical probability lookup table

size_t moot::mootHMM::n_classes
 

Number of known lexical classes

size_t moot::mootHMM::n_tags
 

Number of known tags: used to compute lookup indices

size_t moot::mootHMM::n_toks
 

Number of known tokens: used for sanity checks

size_t moot::mootHMM::ndots
 

Print a dot for every ndots tokens processed if reporting progess. Default=0 (no dot printing).

size_t moot::mootHMM::nfallbacks
 

Number of fallbacks in viterbi_step()

ProbT moot::mootHMM::nglambda1
 

(log) Smoothing constant for unigrams

ProbT moot::mootHMM::nglambda2
 

(log) Smoothing constant for bigrams

BigramProbTable moot::mootHMM::ngprobs2
 

N-gram (log-)probability lookup table: bigrams

size_t moot::mootHMM::nnewclasses
 

Number of unknown-class tokens processed

size_t moot::mootHMM::nnewtokens
 

Total number of unknown-tokens processed

size_t moot::mootHMM::nsents
 

Total number of sentenced processed

size_t moot::mootHMM::ntokens
 

Total number of tokens processed

size_t moot::mootHMM::nunclassed
 

Number of classless tokens processed

size_t moot::mootHMM::nunknown
 

Number of totally unknown (token,class) pairs procesed

bool moot::mootHMM::save_ambiguities
 

Add contents of Viterbi trellis to @analyses members of mootToken elements on tag_mark_best()

bool moot::mootHMM::save_dump_trellis
 

Save Viterbi trellis on tag_sentence()

bool moot::mootHMM::save_flavors
 

Add flavor names to @analyses members of mootToken elements on tag_mark_best()

bool moot::mootHMM::save_mark_unknown
 

Mark unknown tokens with a single analysis '*' on tag_mark_best()

TagID moot::mootHMM::start_tagid
 

Boundary tag, used during compilation, viterbi_start(), and viterbi_finish() This gets set by the start_tag_str argument to compile(). Whatever it is, it should be consistend with what you trained on. Default = "__$" .

SuffixTrie moot::mootHMM::suftrie
 

string-suffix (log-)probability trie

TagIDTable moot::mootHMM::tagids
 

Tag-ID lookup table

TokIDTable moot::mootHMM::tokids
 

Token-ID lookup table

ViterbiColumn* moot::mootHMM::trash_columns [protected]
 

Recycling bin for Viterbi trellis columns

ViterbiNode* moot::mootHMM::trash_nodes [protected]
 

Recycling bin for Viterbi trellis nodes

ViterbiPathNode* moot::mootHMM::trash_pathnodes [protected]
 

Recycling bin for Viterbi path-nodes

LexClass moot::mootHMM::uclass
 

LexClass to use for unknown tokens with no analyses. This gets set at compile-time. You can re-assign it after that if you are so inclined.

ProbT moot::mootHMM::unknown_class_threshhold
 

"Unknown" lexical-class threshhold: used during compilation to determine whether a classes's statistics are recorded as "pure" class probabilities or as probabilities for the "unknown" class. This is just a raw count: the minimum number of times a class must have occurred in the training data in order for us to record statistics about it as "pure" lexical-class probabilities. Default=1

ProbT moot::mootHMM::unknown_lex_threshhold
 

"Unknown" lexical threshhold: used during compilation to determine whether a token's statistics are recorded as "pure" lexical probabilities or as probabilities for the "unknown" token. This is just a raw count: the minimum number of times a token must have occurred in the training data in order for us to record statistics about it as "pure" lexical probabilities. Default=1.

bool moot::mootHMM::use_lex_classes
 

Whether to use class probabilities (Default=true)

Warning:
Don't set this to true unless your input files actually contain a priori analyses generated by the same method on which you trained your model; otherwise, expect abominable accuracy.

ViterbiPathNode* moot::mootHMM::vbestpath [protected]
 

For node->path conversion

ViterbiNode* moot::mootHMM::vbestpn [protected]
 

Best previous node for viterbi_step()

ProbT moot::mootHMM::vbestpr [protected]
 

Best (log-)probability for viterbi_step()

int moot::mootHMM::verbose
 

Verbosity level. See VerbosityLevel typedef. Default=1. Not yet respected by all warnings.

ViterbiColumn* moot::mootHMM::vtable
 

Low-level trellis structure for Viterbi algorithm

TagID moot::mootHMM::vtagid [protected]
 

Current tag-id under consideration for viterbi_step()

ProbT moot::mootHMM::vtagpr [protected]
 

(log-)Probability for current tag-id for viterbi_step()

ProbT moot::mootHMM::vwordpr [protected]
 

Save (log-)word-probability

ProbT moot::mootHMM::wlambda0
 

(log) Smoothing constant for lexical probabilities

ProbT moot::mootHMM::wlambda1
 

(log) Smoothing constant for lexical probabilities


The documentation for this class was generated from the following file:
Generated on Mon Sep 11 16:10:36 2006 for libmoot by doxygen1.2.18