moot FILE FORMATS

This manpage describes various file formats used by the moot PoS tagging utilities.


PROGRAM CONFIGURATION FILES

Most moot utility programs support global and user-specific configuration files which can be used to set system defaults and/or user preferences for values of program options.

Configuration files are expected to contain lines of the form:

 LONG_OPTION_NAME    OPTION_VALUE

where LONG_OPTION_NAME is the long name of one of the program's options, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with a '#' character) are ignored.

The following configuration files are read by default, where ${PROGNAME} is the name of a moot utility program, and ${HOME} is the home directory of the current user:

/etc/${PROGNAME}rc
System defaults file; read first.

${HOME}/.${PROGNAME}rc
User preferences file, can be used to override system defaults.

Any options specified on the command-line override defaults from a program configuration file.


TEXT FILE FORMATS (NATIVE)

Raw Text Files

A ``raw'' text file is just that: any file consiting of (8-bit) characters. Such files may be processed by the the mootpp manpage preprocessor to produce ``rare cooked'' (-tagged, -analyzed) text files. An example ``raw'' text file is:

 This is a test.  This too.

Cooked Text Files

A ``cooked'' text file is a text file which encodes information such as token boundaries, sentence boundaries, part-of-speech tag, and/or potential analyses. The moot utilities distinguish between several different types of cooked text file: in order of ascending informational content, these are: ``rare'' (-tagged, -analyzed), ``medium rare'' (-tagged, +analyzed), ``medium'' (+tagged, -analyzed), ``well done'' (+tagged, +analyzed), and ``refried'' (+tagged, +analyzed, +evaluated). Differnent moot utilities require their input files to be more or less ``cooked'' -- see the documentation of the individual utilities for details.

Native ``cooked'' text files are conventionally identified by the filename infix ``.moot''.

Rare
(-tagged, -analyzed)
  • The most basic level of ``cookedness'', a ``rare'' text file encodes only token- and sentence-boundaries. By convention, ``rare'' filenames carry the extension ``.t''. The syntax is:
     RARE_FILE  ::= {RARE_LINE}*
     RARE_LINE  ::= ({COMMENT} | {EOS} | {RARE_TOKEN}) {NEWLINE}
     COMMENT    ::= {SPACE}* "%%" ([^{NEWLINE}])*
     EOS        ::= ( {SPACE}* {NEWLINE} )+
     RARE_TOKEN ::= {TOKEN_TEXT}
     TOKEN_TEXT ::= ( {WORDCHAR} | {SPACE} )+
     SPACE      ::= " "
     NEWLINE    ::= "\n" | "\r"
     WORDCHAR   ::= [^{SPACE}{NEWLINE}]

    Leading and trailing spaces are stripped from token text; it is thus impossible to declare an ``empty'' token. An example ``rare cooked'' file is:

     %% Example rare cooked file for moot
     %% Sentence 1
     This
     is
     a
     test
     .
     %% Sentence 2
     This
     too
     .

    Medium Rare
    (-tagged, +analyzed)
  • A ``medium rare'' file is at least as informative as a ``rare'' file -- that is, it encodes everything that a ``rare'' file encodes, and in exactly the same fashion. Additionally, a ``medium rare'' file may contain for each token a set of (TAB-separated) possible analyses for that token, where an ``analysis'' contains at least a part-of-speech tag, and possibly also a numeric cost and arbitrary analysis details. Somewhat counter-intuitively, every ``rare'' file is also a ``medium rare'' file for which every token is associated with an empty set of possible analyses. By convention, ``medium rare'' filenames carry the extension ``.mrt''
     MED_RARE_FILE  ::= {MED_RARE_LINE}*
     MED_RARE_LINE  ::= ({COMMENT} | {EOS} | {MED_RARE_TOKEN}) {NEWLINE}
     MED_RARE_TOKEN ::= {TOKEN_TEXT} ( {TAB} {ANALYSIS} )*
     ANALYSIS       ::= {DETAIL_PREFIX}? {TAG} {DETAIL_SUFFIX}?
     DETAIL_PREFIX  ::= ( {WORDCHAR} | {SPACE} | {COST} )* ("[" ("_"?))?
     COST           ::= "<" ("-"|"+")? ([0-9]* ".")? [0-9]+ ">"
     TAG            ::= {TAGCHAR}+
     DETAIL_SUFFIX  ::= ( {WORDCHAR} | {SPACE} | {COST} )*
     TAGCHAR        ::= [^{SPACE}{TAB}{NEWLINE}"]"]
     TAB            ::= "\t"

    Leading and trailing spaces are stripped from token text, as well as from analysis-detail and -tag text. It should be noted that the TAG component of each ANALYSIS is ``greedy'' -- if an analysis contains no left-bracket to mark the beginning of a tag, then the whole analysis (up to the first right-bracket or space) is considered the tag. Also, if the analysis contains multiple left-brackets, only the first is considered to introduce the TAG component. An example ``medium rare'' file is:

     %% Example medium-rare cooked file for moot
     %% Sentence 1 : possible analyses are tags only
     This   NE      PDAT    PDIS
     is     VAFIN   VVFIN
     a      ART     CARD
     test   NN      VVIN
     .      $.
     %% Sentence 2 : detailed analyses, with unknown word "foo".
     This   This [NE type="name"] <420>      <24.7> this [_PDAT][_sg]
     foo
     .      . [$.] <-42>

    Tokens in ``medium rare'' files with empty analysis sets (i.e. RARE_TOKENs) are called ``unrecognized'' tokens.

    Medium
    (+tagged, -analyzed)
  • A ``medium'' file can be understood as a ``medium rare'' file which associates exactly one analysis with each token. The tag for this analysis is considered the ``best'' tag for the associated token. By convention, ``medium'' filenames carry the extension ``.tt'' (tagger output) or ``.ttt'' (gold standard).
     MEDIUM_FILE    ::= {MEDIUM_LINE}*
     MEDIUM_LINE    ::= ({COMMENT} | {EOS} | {MEDIUM_TOKEN}) {NEWLINE}
     MEDIUM_TOKEN   ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS}
     BEST_ANALYSIS  ::= {ANALYSIS}

    As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is ``greedy''. An example ``medium'' file is:

     %% Example medium cooked file for moot
     %% Sentence 1 : best analyses are tags only
     This   PDAT
     is     VVFIN
     a      ART
     test   NN
     .      $.
     %% Sentence 2 : tags embedded in detailed analyses
     This   <24.7> this [PDAT num="sg"]
     too    <0.0> too [ADV]
     .      <-42> . [$.]

    Well Done
    (+tagged, +analyzed)
  • A ``well done'' file can be understood as the synthesis of a ``medium rare'' and a ``medium'' file: it contains a ``best'' analysis for each token (the first one), as well as a set of a priori potential analyses analyses for that token. By convention, ``well done'' filenames carry the extension ``.wd'' (tagger output) or ``.wdt'' (gold standard).
     WELL_DONE_FILE  ::= {WELL_DONE_LINE}*
     WELL_DONE_LINE  ::= ({COMMENT} | {EOS} | {WELL_DONE_TOKEN}) {NEWLINE}
     WELL_DONE_TOKEN ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS} ( {TAB} {ANALYSIS} )*

    As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is ``greedy''. An example ``well done'' file is:

     %% Example well-done cooked file for moot
     %% Sentence 1 : analysis-set tags bracketed for clarity
     This   PDAT    [NE]    [PDAT]    [PDIS]
     is     VVFIN   [VAFIN] [VVFIN]
     a      ART     [ART]   [CARD]
     test   NN      [NN]    [VVFIN]
     .      $.      [$.]
     %% Sentence 2 : analysis-tags embedded in complete analyses
     This   PDAT    [NE type="last"] This <420>  [PDAT num="sg"] this <24.7>
     too    ADV     [ADV] too <0.0>
     .      $.      [$.] . <-42>

    Refried
    (+tagged, +analyzed, +evaluated)
    A ``refried'' file is basically the synthesis of a pair of ``medium'' or ``well done'' files. ``Refried'' files can be created by the mooteval program from a pair of parallel cooked files. Each line of a ``refried'' file contains an status code, and a pair of ``well-done'' style token analyses separated by tabs and a single slash '/'.
     REFRIED_FILE     ::= {REFRIED_LINE}*
     REFRIED_LINE     ::= ( {COMMENT} | {EOS} | {REFRIED_TOKEN} ) {NEWLINE}
     REFRIED_TOKEN    ::= {STATUS_CODE} {TAB} {REFRIED_SOURCES}
     REFRIED_SOURCES  ::= {WELL_DONE_TOKEN} {TAB} "/" {TAB} {WELL_DONE_TOKEN}
     STATUS_CODE      ::= {BASIC_FLAGS} ":" {FILE1_FLAGS} ":" {FILE2_FLAGS}
     BASIC_FLAGS      ::= {TOKMATCH_FLAG} {BESTMATCH_FLAG}
     TOKMATCH_FLAG    ::= "-" | "t"
     BESTMATCH_FLAG   ::= "-" | "b"
     FILE1_FLAGS      ::= {FILE_FLAGS}
     FILE2_FLAGS      ::= {FILE_FLAGS}
     FILE_FLAGS       ::= {EMPTY_FLAG} {IMPOSSIBLE_FLAG} {XIMPOSSIBLE_FLAG}
     EMPTY_FLAG       ::= "-" | "e"
     IMPOSSIBLE_FLAG  ::= "-" | "i"
     XIMPOSSIBLE_FLAG ::= "-" | "x"

    As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is ``greedy''.

    The STATUS_CODE component of each REFRIED_TOKEN encodes a number of flags concerning which part (if any) of the tokens compared did not match. The general convention is use of a '-' character to indicate that the compared tokens matched (or at least were compatible).

    TOKMATCH_FLAG
    '-' if token text components matched, otherwise 't'.

    BESTMATCH_FLAG
    '-' if best-tag components matched, otherwise 'b'.

    EMPTY_FLAG
    '-' if token ANALYSES were non-empty (for the given file), otherwise 'e'.

    IMPOSSIBLE_FLAG
    '-' if token ANALYSES included token BESTTAG (for the corresponding file), otherwise 'i'.

    XIMPOSSIBLE_FLAG
    '-' if token ANALYSES included token BESTTAG for the other file, otherwise 'x'.

    An example ``refried'' file is:

     %% Example refried file for moot
     %% FLAGS       TOK1    TOK1TAG1 ...            /       TOK2    TOK2TAG1 ...
     %%------------------------------------------------------------------------------------
     t-:---:---     Dis     PDAT    [PDAT]  [PDIS]  /       This    PDAT    [PDAT]  [PDIS]
     -b:---:---     is      VAFIN   [VAFIN] [VVFIN] /       is      VVFIN   [VAFIN] [VVFIN]
     --:e--:---     a       ART     /       a       ART     [ART]   [CARD]
     -b:-i-:---     test    NN      [VVFIN] /       test    VVFIN   [NN]    [VVFIN]
     --:---:---     .       $.      [$.]    /       .       $.      [$.]
     
     -b:--x:---     This    PDAT    [PDAT]  /       This    PDIS    [PDAT]  [PDIS]
     --:---:-ix     too     ADV     [ADV]   [PTKA]  /       too     ADV     [CONJ]
     --:---:e--     .       $.      [$.]    /       .       $.


    XML FILE FORMATS

    moot currently uses the (extremely cool and amazingly fast) Expat XML parser library by James Clark for incremental processing of XML documents, (a previous implementation used libxml2 (also extremely cool but not quite as amazingly fast as expat), but the moot libxml2 support is no longer maintained, and is disabled by default), as well as output recoding using librecode by François Pinard. Both expat and librecode support are compile-time options -- check the contents of 'mootConfig.h' to see whether they are enabled on your system.

    When working with ``cooked'' XML (see below), it is critical to remember that the moot internal processing routines always receive token and PoS-tag text encoded in UTF-8, regardless of the document encoding. This is of particular importance when converting from native to XML format i.e. with 'mootchurn' -- it is highly reccommended that you use the 'recode' command-line utility (distributed with 'librecode') to ensure that your native text data is true UTF-8 before passing it to 'mootchurn' for XML output. Future implementations might use locale information to automate this process. If all of your data (training corpus, test corpus, and runtime corpora) are parsed in XML mode, none of the above should present a problem.

    XML files are identified by the filename infix '.xml'.

    Raw XML Files

    A ``raw'' XML file is just like a ``raw'' text file. The 'mootpp' program supports rudimentary recognition and removal of (SG|HT|X)ML markup.

    Cooked XML Files

    As of version 2.0.0, the moot utilities support ``cooked'' XML files, in addition to the native text format(s). See ``Cooked Text Files'' above for more details on the native formats and the information content corresponding to the various subtypes.

    All ``cooked'' XML formats share the same structure (much as the ``cooked'' text formats are defined in terms of one another). The preliminary syntax (subject to change without notice) is:

     COOKED_XML_FILE    ::= {XML_DECL}? {XML_CONTENT}*
     XML_DECL           ::= "<?xml " ... "?>"
     XML_CONTENT        ::= {XML_EOS} | {XML_RAW} | {XML_TOKEN}
     XML_EOS            ::= "<eos/>"
     XML_RAW            ::= ...
     XML_TOKEN          ::= "<token>" {XML_TOKEN_CONTENT} "</token>"
     XML_TOKEN_CONTENT  ::= ({XML_TOKEN_TEXT}
                             | {XML_TOKEN_ANALYSIS}
                             | {XML_TOKEN_BESTTAG}
                             | {XML_RAW})*
     XML_TOKEN_TEXT     ::= "<text>" {TOKEN_TEXT} "</text>"
     XML_TOKEN_BESTTAG  ::= "<moot.tag>" {TOKEN_BESTTAG} "</moot.tag>"
     XML_TOKEN_ANALYSIS ::= '<analysis pos="' {ANALYSIS_TAG} '">' {ANALYSIS_DETAILS} "</analysis>"
     ANALYSIS_DETAILS   ::= {XML_RAW}*

    The document structure is thus expected to be something like the following (in a bastard notation born of BNF and XPath):

     SENTENCE_BOUNDARY  ::= //eos                            # really only end-elts
     TOKEN_TEXT         ::= //token//text/text()             # should be accurate
     ANALYIS_TAG        ::= //token//analysis/@pos           # uses attribute value (not full node)
     ANALYSIS_DETAILS   ::= //token//analysis/text()         # buggy -- actually ignored!
     TOKEN_BESTTAG      ::= //token//moot.tag[last()]/text() # should be accurate

    TODO: pull up literal element name parameters from TokenReaderExpat to user-level.

    TODO: add a DTD for the default XML format to the distribution.

    An example ``cooked'' XML document is the following:

     <?xml version="1.0"?>
     <doc>
      <!-- Sentence-1 : Well Done, Medium, and Medium Rare -->
      <token>
        <!-- A 'well done' token with minimal structure -->
        <text>This</text>
        <moot.tag>PDAT</moot.tag>
        <analysis pos="NE"/>
        <analysis pos="NN"/>
        <analysis pos="PDAT"/>
        <analysis pos="PDS"/>
      </token>
      <token>
        <!-- A 'well done' token with extra structure -->
        <text>is</text>
        <extraneous.element>
          <analysis pos="VAFIN"/>
          <moot.tag>VVFIN</moot.tag>
          <analysis pos="VVFIN"/>
        </extraneous.element>
      </token>
      <token>
        <!-- Yet another 'well done' token  -->
        <text>a</text>
        <other_extraneous_element>
          <analysis pos="ART"/>
        </other_extraneous_element>
        <moot.tag>ART</moot.tag>
      </token>
      <token>
        <!-- A 'medium' token -->
        <text>Test</text>
        <moot.tag>NN</moot.tag>
      </token>
      <token>
        <!-- A 'Medium Rare' token -->
        <text>.</text>
        <analysis pos="$."/>
      </token>
      <eos/>
      <!-- Sentence-2 : Rare tokens only -->
      <token><text>This</text></token>
      <token><text>too</text></token>
      <token><text>.</text></token>
      <eos/>
     </doc>

    I/O Format Flags

    Several moot utilities are capable of processing input in a number of different formats, typically specified by '--input-format' (-I) and '--output-format' (-O) command-line options The following list briefly describes the (case-insensitive) format flags which may occur as individual elements of the comma-separated list passed as an argument to these format options. Each format flag may be preceeded by an exclamation point ``!'' to indicate the negation of the respective format property. Note that at the current time, not all formats support all available flags.

    If no format flags are specified by the user, the moot utilities will attempt to guess an appropriate format based on the filename and on the requirements for the particular utility in question.

    Basic Flags
    None
    No flags at all. This should never really happen at runtime, and should cause a default format to be assumed.

    Null
    If you specify 'null' as an output format, no output will actually be written (useful for testing and benchmarking the input layer).

    Unknown
    Unknown format. This should never ever happen, and should cause a reversion to some default format.

    Native
    Specifies native text format I/O, as opposed to XML.

    XML
    Specifies XML format I/O, as opposed to a native text format.

    Pretty
    Beautified XML format. Useful for human-readable XML output. Not all XML I/O modes support cosmetic surgery.

    Conserve
    Conservative XML format: attempt to preserve as much of the input document structure as possible. Only meaningful if both XML input and XML output are requested.

    Text
    Read/write token text (all formats).

    Analyzed
    Read/write token analyses ('medium rare' or 'well done' formats only).

    Tagged
    Read/write 'best tags' ('medium' or 'well done' formats only).

    Pruned
    For 'well done' formats, ignore analyses which do not correspond to the 'best' tag.

    Compound Flags
    Rare
    R
    Alias for 'Text'.

    MediumRare
    MR
    Alias for 'Text,Analyzed'.

    Medium
    M
    Alias for 'Text,Tagged'.

    'WellDone'
    WD
    Alias for 'Text,Tagged,Analyzed'

    Examples


    HMM MODEL FILE FORMATS

    The moothmm(1) program can use either text- or native binary-format model files, which encode raw frequency counts (text model files), or probability tables and compile-time flags for the Hidden Markov Model (binary model files), respectively.

    Text Models

    A ``Text Model'' is completely specified by three files: a lexical freqency file (*.lex), an n-gram frequency file (*.123), and an optional lexical-class frequency file (*.clx).

    When specifiying a text model name to a moot utility program, you may specify the model name as TMODEL in order to use the files TMODEL.lex , TMODEL.123 , and TMODEL.clx (if present). Otherwise, you may specifiy a composite model name as a comma-separated list of the individual component filenames: mylex.lex,myngrams.123,myclasses.clx.

    Lexical Frequency Files
    Lexical frequency files store raw frequencies for known tokens and (token,tag) pairs. The format use is ca. 99.998% compatible with that generated by the tnt-para(1) program:
     LEX_FILE    ::= ({COMMENT} | {BLANK_LINE} | {LEX_ENTRY})*
     COMMENT     ::= {SPACE}* "%%" ([^{NEWLINE}])*  {NEWLINE}
     BLANK_LINE  ::= {SPACE}* {NEWLINE}
     LEX_ENTRY   ::= {TOKEN_TEXT} {TAB} {TOKEN_TOTAL} ( {TAB} {TAG_COUNT} )*
     TAG_COUNT   ::= {TAG_TEXT} {TAB} {TOK_TAG_CT}
     TOKEN_TOTAL ::= {COUNT}
     TOK_TAG_CT  ::= {COUNT}
     TOKEN_TEXT  ::= {STRING} | {SPECIAL_TOK}
     TAG_TEXT    ::= {STRING}
     STRING      ::= ( [^{TAB}{NEWLINE}] )+
     COUNT       ::=  ("-"|"+")? ([0-9]* ".")? [0-9]+
     NEWLINE     ::= "\n" | "\r"
     TAB         ::= "\t"
     SPECIAL_TOK ::= "@UNKNOWN"
                     | "@CARD"
                     | "@CARDSEPS"
                     | "@CARDPUNCT"
                     | "@CARDSUFFIX"

    Leading and trailing spaces are stripped from token and tag text.

    The special tokens whose text begins with an '@' character declare counts for special token types:

    An example lexical frequency file is:

     %% Example lexical frequency file
     This   4       PDAT    4
     is     1.0     VVFIN   0.7     VAFIN   0.3
     a      365     ART     350     CARD    5
     test   1       NN      0.5     VVFIN   0.5
     too    1       ADV     1
     .      42      $.      42
    Ngram Frequency Files
    An n-gram frequency file stores raw frequency counts for uni-, bi-, and tri-grams. An n-gram file may be in either ``long'' or ``short'' format, both of which are compatible with the respective formats produced by the tnt-para(1) program:
     NGRAM_FILE  ::= ({COMMENT} | {BLANK_LINE} | {NGRAM_ENTRY})*
     COMMENT     ::= {SPACE}* "%%" ([^{NEWLINE}])*  {NEWLINE}
     BLANK_LINE  ::= {SPACE}* {NEWLINE}
     NGRAM_ENTRY ::= {UNIGRAM} | {BIGRAM} | {TRIGRAM}
     UNIGRAM     ::= {TAG} {TAB} {COUNT}
     BIGRAM      ::= {TAG} {TAB} {TAG} {TAB} {COUNT}
     TRIGRAM     ::= {TAG} {TAB} {TAG} {TAB} {TAG} {TAB} {COUNT}
     TAG         ::= EOS_TAG | ( [^{TAB}{NEWLINE}] )*
     EOS_TAG     ::= "__$"
     COUNT       ::=  ("-"|"+")? ([0-9]* ".")? [0-9]+
     NEWLINE     ::= "\n" | "\r"
     TAB         ::= "\t"

    Leading and trailing spaces are stripped from tags. An empty TAG component is populated with the tag in the corresponding position from the last n-gram parsed -- exhaustive use of this feature produces ``short'' format n-gram files. Non-use of this feature produces ``long'' format n-gram files.

    An example ``long'' format n-gram file is:

     %% Example n-gram frequency file in "long" format
     __$    2
     __$    PDAT    2
     __$    PDAT    VVFIN   1
     __$    PDAT    ADV     1
     ADV    1
     ADV    $.      1
     ADV    $.      __$     1
     ART    1
     ART    NN      1
     ART    NN      $.      1
     PDAT   2
     PDAT   VVFIN   1
     PDAT   VVFIN   ART     1
     PDAT   ADV     1
     PDAT   ADV     $.      1
     VVFIN  1
     VVFIN  ART     1
     VVFIN  ART     NN      1
     NN     1
     NN     $.      1
     NN     $.      __$     1

    The same data in ``short'' format:

     %% Example n-gram frequency file in "short" format
     __$    2
            PDAT    2
                    VVFIN   1
                    ADV     1
     ADV    1
            $.      1
                    __$     1
     ART    1
                    1
                    $.      1
     PDAT   2
            VVFIN   1
                    ART     1
            ADV     1
                    $.      1
     VVFIN  1
            ART     1
                    NN      1
     NN     1
            $.      1
                    __$     1

    Lexical-Class Frequency Files
    Lexical-class frequency files store raw frequencies for known lexical classes (read ``sets of possible part-of-speech tags'') and (class,tag) pairs. The format is a direct extension of the format for lexical frequency files (see ``Lexical Frequency Files'', above):
     CLASS_FILE  ::= ({COMMENT} | {BLANK_LINE} | {CLASS_ENTRY})*
     CLASS_ENTRY ::= {CLASS_ELTS} {TAB} {CLASS_TOTAL} ( {TAB} {TAG_COUNT} )*
     CLASS_ELTS  ::= ( {CLASS_TAG} {SPACE} )*
     CLASS_TAG   ::= ( [^{SPACE}{TAB}{NEWLINE}] )+

    As for lexical frequency files, leading and trailing whitespaces are stripped from class and tag text.

    The CLASS_ELTS component specifies a (space-separated) list of tags belonging to the lexical class. All other (tab-separated) fields are as for a lexical frequency file.

    A pair (CLASS,TAG) such that TAG is not an element of CLASS is called an ``contradictory pair'' or an ``impossible pair''. It is not required that the the tags in the TAG_COUNT components of a CLASS_ENTRY are ``possible'' in this sense, although it certainly helps if this is the case.

    An example lexical class frequency file is:

     %% Example lexical frequency file
     PDAT NE        4       PDAT    4
     VVFIN VAFIN    1.0     VVFIN   0.7     VAFIN   0.3
     ART CARD       365     ART     350     CARD    5
     NN VVFIN       1       NN      0.5     VVFIN   0.5
     ADV            1       ADV     1
     $.             42      $.      42

    HMM Binary Model Files

    A ``Binary Model'' BINMODEL is a (compressed) binary format file storing a compiled Hidden Markov Model (probabilities and constants). It is completely specified by its filename BINMODEL. By convention, HMM binary model files carry the suffix ``.hmm''.

    When specifying an HMM model file, note that the existence of a file BINMODEL overrides any text models which might exists in files BINMODEL.lex , BINMODEL.123 , BINMODEL.clx.

    HMM Dumps

    An HMM dump is a plain text file containing all the information stored in a compiled HMM. The format exists solely for purposes of debugging.


    ACKNOWLEDGEMENTS

    Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( ``collocations in the dictionary'', http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( ``digital dictionary of the German language of the 20th century'', http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.

    I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.

    Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used in the development of the class-based HMM tagger / disambiguator. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development of the class-based HMM tagger / disambiguator.


    AUTHOR

    Bryan Jurish <moocow@ling.uni-potsdam.de>


    SEE ALSO

    the mootutils manpage