This manpage describes various file formats used by the moot PoS tagging utilities.
Most moot utility programs support global and user-specific configuration files which can be used to set system defaults and/or user preferences for values of program options.
Configuration files are expected to contain lines of the form:
LONG_OPTION_NAME OPTION_VALUE
where LONG_OPTION_NAME is the long name of one of the program's options, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with a '#' character) are ignored.
The following configuration files are read by default, where ${PROGNAME} is the name of a moot utility program, and ${HOME} is the home directory of the current user:
System defaults file; read first.
User preferences file, can be used to override system defaults.
Any options specified on the command-line override defaults from a program configuration file.
A "raw" text file is just that: any file consiting of (8-bit) characters. Such files may be processed by the the mootpp manpage preprocessor to produce "rare cooked" (-tagged, -analyzed) text files. An example "raw" text file is:
This is a test. This too.
A "cooked" text file is a text file which encodes information such as token boundaries, sentence boundaries, part-of-speech tag, and/or potential analyses. The moot utilities distinguish between several different types of cooked text file: in order of ascending informational content, these are: "rare" (-tagged, -analyzed), "medium rare" (-tagged, +analyzed), "medium" (+tagged, -analyzed), "well done" (+tagged, +analyzed), and "refried" (+tagged, +analyzed, +evaluated). Differnent moot utilities require their input files to be more or less "cooked" -- see the documentation of the individual utilities for details.
Native "cooked" text files are conventionally identified by the filename infix ".moot".
The most basic level of "cookedness", a "rare" text file encodes only token- and sentence-boundaries. By convention, "rare" filenames carry the extension ".t". The syntax is:
RARE_FILE ::= {RARE_LINE}* RARE_LINE ::= ({COMMENT} | {EOS} | {RARE_TOKEN}) {NEWLINE} COMMENT ::= {SPACE}* "%%" ([^{NEWLINE}])* EOS ::= ( {SPACE}* {NEWLINE} )+ RARE_TOKEN ::= {TOKEN_TEXT} TOKEN_TEXT ::= ( {WORDCHAR} | {SPACE} )+ SPACE ::= " " NEWLINE ::= "\n" | "\r" WORDCHAR ::= [^{SPACE}{NEWLINE}]
Leading and trailing spaces are stripped from token text; it is thus impossible to declare an "empty" token. An example "rare cooked" file is:
%% Example rare cooked file for moot %% Sentence 1 This is a test . %% Sentence 2 This too .
A "medium rare" file is at least as informative as a "rare" file -- that is, it encodes everything that a "rare" file encodes, and in exactly the same fashion. Additionally, a "medium rare" file may contain for each token a set of (TAB-separated) possible analyses for that token, where an "analysis" contains at least a part-of-speech tag, and possibly also a numeric cost and arbitrary analysis details. Somewhat counter-intuitively, every "rare" file is also a "medium rare" file for which every token is associated with an empty set of possible analyses. By convention, "medium rare" filenames carry the extension ".mrt"
MED_RARE_FILE ::= {MED_RARE_LINE}* MED_RARE_LINE ::= ({COMMENT} | {EOS} | {MED_RARE_TOKEN}) {NEWLINE} MED_RARE_TOKEN ::= {TOKEN_TEXT} ( {TAB} {ANALYSIS} )* ANALYSIS ::= {DETAIL_PREFIX}? {TAG} {DETAIL_SUFFIX}? DETAIL_PREFIX ::= ( {WORDCHAR} | {SPACE} | {COST} )* ("[" ("_"?))? COST ::= "<" ("-"|"+")? ([0-9]* ".")? [0-9]+ ">" TAG ::= {TAGCHAR}+ DETAIL_SUFFIX ::= ( {WORDCHAR} | {SPACE} | {COST} )* TAGCHAR ::= [^{SPACE}{TAB}{NEWLINE}"]"] TAB ::= "\t"
Leading and trailing spaces are stripped from token text, as well as from analysis-detail and -tag text. It should be noted that the TAG component of each ANALYSIS is "greedy" -- if an analysis contains no left-bracket to mark the beginning of a tag, then the whole analysis (up to the first right-bracket or space) is considered the tag. Also, if the analysis contains multiple left-brackets, only the first is considered to introduce the TAG component. An example "medium rare" file is:
%% Example medium-rare cooked file for moot %% Sentence 1 : possible analyses are tags only This NE PDAT PDIS is VAFIN VVFIN a ART CARD test NN VVIN . $. %% Sentence 2 : detailed analyses, with unknown word "foo". This This [NE type="name"] <420> <24.7> this [_PDAT][_sg] foo . . [$.] <-42>
Tokens in "medium rare" files with empty analysis sets (i.e. RARE_TOKENs) are called "unrecognized" tokens.
A "medium" file can be understood as a "medium rare" file which associates exactly one analysis with each token. The tag for this analysis is considered the "best" tag for the associated token. By convention, "medium" filenames carry the extension ".tt" (tagger output) or ".ttt" (gold standard).
MEDIUM_FILE ::= {MEDIUM_LINE}* MEDIUM_LINE ::= ({COMMENT} | {EOS} | {MEDIUM_TOKEN}) {NEWLINE} MEDIUM_TOKEN ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS} BEST_ANALYSIS ::= {ANALYSIS}
As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is "greedy". An example "medium" file is:
%% Example medium cooked file for moot %% Sentence 1 : best analyses are tags only This PDAT is VVFIN a ART test NN . $. %% Sentence 2 : tags embedded in detailed analyses This <24.7> this [PDAT num="sg"] too <0.0> too [ADV] . <-42> . [$.]
A "well done" file can be understood as the synthesis of a "medium rare" and a "medium" file: it contains a "best" analysis for each token (the first one), as well as a set of a priori potential analyses analyses for that token. By convention, "well done" filenames carry the extension ".wd" (tagger output) or ".wdt" (gold standard).
WELL_DONE_FILE ::= {WELL_DONE_LINE}* WELL_DONE_LINE ::= ({COMMENT} | {EOS} | {WELL_DONE_TOKEN}) {NEWLINE} WELL_DONE_TOKEN ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS} ( {TAB} {ANALYSIS} )*
As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is "greedy". An example "well done" file is:
%% Example well-done cooked file for moot %% Sentence 1 : analysis-set tags bracketed for clarity This PDAT [NE] [PDAT] [PDIS] is VVFIN [VAFIN] [VVFIN] a ART [ART] [CARD] test NN [NN] [VVFIN] . $. [$.] %% Sentence 2 : analysis-tags embedded in complete analyses This PDAT [NE type="last"] This <420> [PDAT num="sg"] this <24.7> too ADV [ADV] too <0.0> . $. [$.] . <-42>
A "refried" file is basically the synthesis of a pair of "medium" or "well done" files. "Refried" files can be created by the mooteval program from a pair of parallel cooked files. Each line of a "refried" file contains an status code, and a pair of "well-done" style token analyses separated by tabs and a single slash '/'.
REFRIED_FILE ::= {REFRIED_LINE}* REFRIED_LINE ::= ( {COMMENT} | {EOS} | {REFRIED_TOKEN} ) {NEWLINE} REFRIED_TOKEN ::= {STATUS_CODE} {TAB} {REFRIED_SOURCES} REFRIED_SOURCES ::= {WELL_DONE_TOKEN} {TAB} "/" {TAB} {WELL_DONE_TOKEN} STATUS_CODE ::= {BASIC_FLAGS} ":" {FILE1_FLAGS} ":" {FILE2_FLAGS} BASIC_FLAGS ::= {TOKMATCH_FLAG} {BESTMATCH_FLAG} TOKMATCH_FLAG ::= "-" | "t" BESTMATCH_FLAG ::= "-" | "b" FILE1_FLAGS ::= {FILE_FLAGS} FILE2_FLAGS ::= {FILE_FLAGS} FILE_FLAGS ::= {EMPTY_FLAG} {IMPOSSIBLE_FLAG} {XIMPOSSIBLE_FLAG} EMPTY_FLAG ::= "-" | "e" IMPOSSIBLE_FLAG ::= "-" | "i" XIMPOSSIBLE_FLAG ::= "-" | "x"
As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is "greedy".
The STATUS_CODE component of each REFRIED_TOKEN encodes a number of flags concerning which part (if any) of the tokens compared did not match. The general convention is use of a '-' character to indicate that the compared tokens matched (or at least were compatible).
'-' if token text components matched, otherwise 't'.
'-' if best-tag components matched, otherwise 'b'.
'-' if token ANALYSES were non-empty (for the given file), otherwise 'e'.
'-' if token ANALYSES included token BESTTAG (for the corresponding file), otherwise 'i'.
'-' if token ANALYSES included token BESTTAG for the other file, otherwise 'x'.
An example "refried" file is:
%% Example refried file for moot %% FLAGS TOK1 TOK1TAG1 ... / TOK2 TOK2TAG1 ... %%------------------------------------------------------------------------------------ t-:---:--- Dis PDAT [PDAT] [PDIS] / This PDAT [PDAT] [PDIS] -b:---:--- is VAFIN [VAFIN] [VVFIN] / is VVFIN [VAFIN] [VVFIN] --:e--:--- a ART / a ART [ART] [CARD] -b:-i-:--- test NN [VVFIN] / test VVFIN [NN] [VVFIN] --:---:--- . $. [$.] / . $. [$.] -b:--x:--- This PDAT [PDAT] / This PDIS [PDAT] [PDIS] --:---:-ix too ADV [ADV] [PTKA] / too ADV [CONJ] --:---:e-- . $. [$.] / . $.
Re-formatting for better human readabilty produces:
%% Example refried file for moot %% FLAGS TOK1 TOK1TAG1 ... / TOK2 TOK2TAG1 ... %%------------------------------------------------------------------------------------ t-:---:--- Dis PDAT [PDAT] [PDIS] / This PDAT [PDAT] [PDIS] -b:---:--- is VAFIN [VAFIN] [VVFIN] / is VVFIN [VAFIN] [VVFIN] --:e--:--- a ART / a ART [ART] [CARD] -b:-i-:--- test NN [VVFIN] / test VVFIN [NN] [VVFIN] --:---:--- . $. [$.] / . $. [$.] -b:--x:--- This PDAT [PDAT] / This PDIS [PDAT] [PDIS] --:---:-ix too ADV [ADV] [PTKA] / too ADV [CONJ] --:---:e-- . $. [$.] / . $.
moot currently uses the (extremely cool and amazingly fast) Expat XML parser library by James Clark for incremental processing of XML documents, (a previous implementation used libxml2 (also extremely cool but not quite as amazingly fast as expat), but the moot libxml2 support is no longer maintained, and is disabled by default), as well as output recoding using librecode by François Pinard. Both expat and librecode support are compile-time options -- check the contents of 'mootConfig.h' to see whether they are enabled on your system.
When working with "cooked" XML (see below), it is critical to remember that the moot internal processing routines always receive token and PoS-tag text encoded in UTF-8, regardless of the document encoding. This is of particular importance when converting from native to XML format i.e. with 'mootchurn' -- it is highly reccommended that you use the 'recode' command-line utility (distributed with 'librecode') to ensure that your native text data is true UTF-8 before passing it to 'mootchurn' for XML output.
Similarly, HMM model data (see HMM MODEL FILE FORMATS) must be UTF-8 encoded for tagging in XML mode. There is currently no way to directly convert the encoding of a binary model file, but text model files can be converted with the 'recode' command-line utility.
Future implementations might use locale information to (partially) automate the recoding process. If all of your data (training corpus, test corpus, and runtime corpora) are parsed in XML mode, none of the above should present a problem.
XML files are identified by the filename infix '.xml'.
A "raw" XML file is just like a "raw" text file. The 'mootpp' program supports rudimentary recognition and removal of (SG|HT|X)ML markup.
As of version 2.0.0, the moot utilities support "cooked" XML files, in addition to the native text format(s). See "Cooked Text Files" above for more details on the native formats and the information content corresponding to the various subtypes.
All "cooked" XML formats share the same structure (much as the "cooked" text formats are defined in terms of one another). The preliminary syntax (subject to change without notice) is:
COOKED_XML_FILE ::= {XML_DECL}? {XML_CONTENT}* XML_DECL ::= "<?xml " ... "?>" XML_CONTENT ::= {XML_EOS} | {XML_RAW} | {XML_TOKEN} XML_EOS ::= "<eos/>" XML_RAW ::= ... XML_TOKEN ::= "<token>" {XML_TOKEN_CONTENT} "</token>" XML_TOKEN_CONTENT ::= ({XML_TOKEN_TEXT} | {XML_TOKEN_ANALYSIS} | {XML_TOKEN_BESTTAG} | {XML_RAW})* XML_TOKEN_TEXT ::= "<text>" {TOKEN_TEXT} "</text>" XML_TOKEN_BESTTAG ::= "<moot.tag>" {TOKEN_BESTTAG} "</moot.tag>" XML_TOKEN_ANALYSIS ::= '<analysis pos="' {ANALYSIS_TAG} '">' {ANALYSIS_DETAILS} "</analysis>" ANALYSIS_DETAILS ::= {XML_RAW}*
The document structure is thus expected to be something like the following (in a bastard notation born of BNF and XPath):
SENTENCE_BOUNDARY ::= //eos # really only end-elts TOKEN_TEXT ::= //token//text/text() # should be accurate ANALYIS_TAG ::= //token//analysis/@pos # uses attribute value (not full node) ANALYSIS_DETAILS ::= //token//analysis/text() # buggy -- actually ignored! TOKEN_BESTTAG ::= //token//moot.tag[last()]/text() # should be accurate
Contact the author if you need any of the following done:
Pull up literal element name parameters from TokenReaderExpat to user-level.
Add a DTD for the default XML format to the distribution.
An example "cooked" XML document is the following:
<?xml version="1.0"?> <doc> <!-- Sentence-1 : Well Done, Medium, and Medium Rare --> <token> <!-- A 'well done' token with minimal structure --> <text>This</text> <moot.tag>PDAT</moot.tag> <analysis pos="NE"/> <analysis pos="NN"/> <analysis pos="PDAT"/> <analysis pos="PDS"/> </token> <token> <!-- A 'well done' token with extra structure --> <text>is</text> <extraneous.element> <analysis pos="VAFIN"/> <moot.tag>VVFIN</moot.tag> <analysis pos="VVFIN"/> </extraneous.element> </token> <token> <!-- Yet another 'well done' token --> <text>a</text> <other_extraneous_element> <analysis pos="ART"/> </other_extraneous_element> <moot.tag>ART</moot.tag> </token> <token> <!-- A 'medium' token --> <text>Test</text> <moot.tag>NN</moot.tag> </token> <token> <!-- A 'Medium Rare' token --> <text>.</text> <analysis pos="$."/> </token> <eos/> <!-- Sentence-2 : Rare tokens only --> <token><text>This</text></token> <token><text>too</text></token> <token><text>.</text></token> <eos/> </doc>
Several moot utilities are capable of processing input in a number of different formats, typically specified by '--input-format' (-I) and '--output-format' (-O) command-line options The following list briefly describes the (case-insensitive) format flags which may occur as individual elements of the comma-separated list passed as an argument to these format options. Each format flag may be preceeded by an exclamation point "!" to indicate the negation of the respective format property. Note that at the current time, not all formats support all available flags.
If no format flags are specified by the user, the moot utilities will attempt to guess an appropriate format based on the filename and on the requirements for the particular utility in question.
No flags at all. This should never really happen at runtime, and should cause a default format to be assumed and/or an appropriate format to be guessed from the relevant filename(s).
If you specify 'null' as an output format, no output will actually be written (useful for testing and benchmarking the input layer).
Unknown format. This should never ever happen, and should cause a reversion to some default format.
Specifies native text format I/O, as opposed to XML.
Specifies XML format I/O, as opposed to a native text format.
Beautified XML format. Useful for human-readable XML output. Not all XML I/O modes support cosmetic surgery.
Conservative XML format: attempt to preserve as much of the input document structure as possible. Only meaningful if both XML input and XML output are requested.
Read/write token text (all formats).
Read/write token analyses ('medium rare' or 'well done' formats only).
Read/write 'best tags' ('medium' or 'well done' formats only).
Read/write token locations as logical pairs (BYTE_OFFSET,BYTE_LENGTH) from/to the input stream as the first non-tag analysis. Useful if you need to refer back to earlier stages of a token processing pipeline.
Read/write analysis "costs" from/to analysis "<NUMBER>" suffixes. This flag may be set by default in future versions.
If set as an output format flag, causes a verbose dump of the Viterbi trellis to be spliced into every tagged sentence as post-token comments. Does nothing as an input flag (yet).
For 'well done' formats, ignore analyses which do not correspond to the 'best' tag.
Alias for 'Text'.
Alias for 'Text,Analyzed'.
Alias for 'Text,Tagged'.
Alias for 'Text,Tagged,Analyzed'
Read input as native rare text (tokens only), write output as medium (best-tagged) native text:
moot --input-format=native,text --output-format=native,text,tagged
Same thing, only shorter:
moot --input-format=rare --output-format=medium
Same thing, even shorter:
moot -Ir -Om
Same thing, using filename conventions:
moot input.moot.t -o output.moot.tt
Read medium rare (pre-analyzed) XML, write well-done native text:
moot -I xml,mediumrare -O native,welldone
Same thing, using filename conventions:
moot input.mr.xml -o output.wd.moot
The moothmm(1)
program can use either text- or native
binary-format model files, which encode raw frequency counts
(text model files), or probability tables and compile-time
flags for the Hidden Markov Model (binary model files),
respectively.
A "Text Model" is completely specified by three files: a lexical freqency file (*.lex), an n-gram frequency file (*.123), and an optional lexical-class frequency file (*.clx).
When specifiying a text model name to a moot utility program, you may specify the model name as TMODEL in order to use the files TMODEL.lex , TMODEL.123 , and TMODEL.clx (if present). Otherwise, you may specifiy a composite model name as a comma-separated list of the individual component filenames: mylex.lex,myngrams.123,myclasses.clx.
Lexical frequency files store raw frequencies for known tokens and (token,tag) pairs. The format use is ca. 99.998% compatible with that generated by the tnt-para(1) program:
LEX_FILE ::= ({COMMENT} | {BLANK_LINE} | {LEX_ENTRY})* COMMENT ::= {SPACE}* "%%" ([^{NEWLINE}])* {NEWLINE} BLANK_LINE ::= {SPACE}* {NEWLINE} LEX_ENTRY ::= {TOKEN_TEXT} {TAB} {TOKEN_TOTAL} ( {TAB} {TAG_COUNT} )* TAG_COUNT ::= {TAG_TEXT} {TAB} {TOK_TAG_CT} TOKEN_TOTAL ::= {COUNT} TOK_TAG_CT ::= {COUNT} TOKEN_TEXT ::= {STRING} | {SPECIAL_TOK} TAG_TEXT ::= {STRING} STRING ::= ( [^{TAB}{NEWLINE}] )+ COUNT ::= ("-"|"+")? ([0-9]* ".")? [0-9]+ NEWLINE ::= "\n" | "\r" TAB ::= "\t" SPECIAL_TOK ::= "@UNKNOWN" | "@CARD" | "@CARDSEPS" | "@CARDPUNCT" | "@CARDSUFFIX"
Leading and trailing spaces are stripped from token and tag text.
The special tokens whose text begins with an '@' character declare counts for special token types:
Declares frequency counts to be used when no other training data is available (i.e. for alphabetic tokens which did not occur in the training corpus).
Declares frequency counts to be used for tokens consisting only of digits -- tokens which match the regex:
[0-9]+
Declares frequency counts to be used for tokens which contain digits and separators. The regex matching such tokens is (?):
([[:digit:]]+)([\.\,\-]|[[:digit:]])*
Declares frequency counts to be used for tokens which contain digits followed by punctuation. The regex matching such tokens is (?):
([[:digit:]]+)([[:punct:]])
Declares frequency counts to be used for tokens which
contain digits followed by some suffix.
The regex matching
these tokens depends on whether the moot_TNT_COMPAT
macro
macro was defined when you compiled libmoot:
If moot_TNT_COMPAT
was defined, then the suffix
of "@CARDSUFFIX" tokens is required to be of maximum
length 3, thus matching the regex:
([[:digit:]]+)(.{1,3})
Otherwise, the suffix for "@CARDSUFFIX" tokens may be of arbitrary length:
([[:digit:]]+)(.*)
An example lexical frequency file is:
%% Example lexical frequency file This 4 PDAT 4 is 1.0 VVFIN 0.7 VAFIN 0.3 a 365 ART 350 CARD 5 test 1 NN 0.5 VVFIN 0.5 too 1 ADV 1 . 42 $. 42
An n-gram frequency file stores raw frequency counts for uni-, bi-, and tri-grams. An n-gram file may be in either "long" or "short" format, both of which are compatible with the respective formats produced by the tnt-para(1) program:
NGRAM_FILE ::= ({COMMENT} | {BLANK_LINE} | {NGRAM_ENTRY})* COMMENT ::= {SPACE}* "%%" ([^{NEWLINE}])* {NEWLINE} BLANK_LINE ::= {SPACE}* {NEWLINE} NGRAM_ENTRY ::= {UNIGRAM} | {BIGRAM} | {TRIGRAM} UNIGRAM ::= {TAG} {TAB} {COUNT} BIGRAM ::= {TAG} {TAB} {TAG} {TAB} {COUNT} TRIGRAM ::= {TAG} {TAB} {TAG} {TAB} {TAG} {TAB} {COUNT} TAG ::= EOS_TAG | ( [^{TAB}{NEWLINE}] )* EOS_TAG ::= "__$" COUNT ::= ("-"|"+")? ([0-9]* ".")? [0-9]+ NEWLINE ::= "\n" | "\r" TAB ::= "\t"
Leading and trailing spaces are stripped from tags. An empty TAG component is populated with the tag in the corresponding position from the last n-gram parsed -- exhaustive use of this feature produces "short" format n-gram files. Non-use of this feature produces "long" format n-gram files.
An example "long" format n-gram file is:
%% Example n-gram frequency file in "long" format __$ 2 __$ PDAT 2 __$ PDAT VVFIN 1 __$ PDAT ADV 1 ADV 1 ADV $. 1 ADV $. __$ 1 ART 1 ART NN 1 ART NN $. 1 PDAT 2 PDAT VVFIN 1 PDAT VVFIN ART 1 PDAT ADV 1 PDAT ADV $. 1 VVFIN 1 VVFIN ART 1 VVFIN ART NN 1 NN 1 NN $. 1 NN $. __$ 1
The same data in "short" format:
%% Example n-gram frequency file in "short" format __$ 2 PDAT 2 VVFIN 1 ADV 1 ADV 1 $. 1 __$ 1 ART 1 1 $. 1 PDAT 2 VVFIN 1 ART 1 ADV 1 $. 1 VVFIN 1 ART 1 NN 1 NN 1 $. 1 __$ 1
Lexical-class frequency files store raw frequencies for known lexical classes (read "sets of possible part-of-speech tags") and (class,tag) pairs. The format is a direct extension of the format for lexical frequency files (see "Lexical Frequency Files", above):
CLASS_FILE ::= ({COMMENT} | {BLANK_LINE} | {CLASS_ENTRY})* CLASS_ENTRY ::= {CLASS_ELTS} {TAB} {CLASS_TOTAL} ( {TAB} {TAG_COUNT} )* CLASS_ELTS ::= ( {CLASS_TAG} {SPACE} )* CLASS_TAG ::= ( [^{SPACE}{TAB}{NEWLINE}] )+
As for lexical frequency files, leading and trailing whitespaces are stripped from class and tag text.
The CLASS_ELTS component specifies a (space-separated) list of tags belonging to the lexical class. All other (tab-separated) fields are as for a lexical frequency file.
A pair (CLASS,TAG) such that TAG is not an element of CLASS is called an "contradictory pair" or an "impossible pair". It is not required that the the tags in the TAG_COUNT components of a CLASS_ENTRY are "possible" in this sense, although it certainly helps if this is the case.
An example lexical class frequency file is:
%% Example lexical frequency file PDAT NE 4 PDAT 4 VVFIN VAFIN 1.0 VVFIN 0.7 VAFIN 0.3 ART CARD 365 ART 350 CARD 5 NN VVFIN 1 NN 0.5 VVFIN 0.5 ADV 1 ADV 1 $. 42 $. 42
A "Binary Model" BINMODEL is a (compressed) binary format file storing a compiled Hidden Markov Model (probabilities and constants). It is completely specified by its filename BINMODEL. By convention, HMM binary model files carry the suffix ".hmm".
When specifying an HMM model file, note that the existence of a file BINMODEL overrides any text models which might exists in files BINMODEL.lex , BINMODEL.123 , BINMODEL.clx. Use of a conventional suffix (such as ".hmm") to identify binary models eliminates such problems, since MODEL.hmm will not clash with a text model MODEL.lex, ...
An HMM dump is a plain text file containing all the information stored in a compiled HMM. The format exists solely for purposes of debugging.
Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.
I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.
Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used in the development of the class-based HMM tagger / disambiguator. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development of the class-based HMM tagger / disambiguator.
Bryan Jurish <jurish@uni-potsdam.de>