moot FILE FORMATS

This manpage describes various file formats used by the moot PoS tagging utilities.

PROGRAM CONFIGURATION FILES

Most moot utility programs support global and user-specific configuration files which can be used to set system defaults and/or user preferences for values of program options.

Configuration files are expected to contain lines of the form:

 LONG_OPTION_NAME    OPTION_VALUE

where LONG_OPTION_NAME is the long name of one of the program's options, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with a '#' character) are ignored.

The following configuration files are read by default, where ${PROGNAME} is the name of a moot utility program, and ${HOME} is the home directory of the current user:

/etc/${PROGNAME}rc

System defaults file; read first.

${HOME}/.${PROGNAME}rc

User preferences file, can be used to override system defaults.

Any options specified on the command-line override defaults from a program configuration file.

TEXT FILE FORMATS (NATIVE)

Raw Text Files

A "raw" text file is just that: any file consiting of (8-bit or variable-width encoded) characters. Such files may be processed by the mootpp preprocessor to produce "rare cooked" (-tagged, -analyzed) text files, or by the waste tokenizer with an appropriate tokenization model. An example "raw" text file is:

 This is a test.  This too.

Raw text files passed to the waste scanner may additionally contain comments, escapes, and word- and sentence-break hints:

 %% COMMENT %%      /* an inline comment string */
 $%$                /* literal percent sign (escape) */
 $WB$               /* word-break hint */
 $SB$               /* sentence-break hint */

An example input file for the waste scanner is:

 This is a %% comment %% text.
 %% comments can contain embedded newlines, %
    provided that they're escaped by a single '%' sign. %%
 This is an escaped percent sign $%$.
 This is a word-break hint: word-$WB$break.
 This is a sentence-break hint: sentence-$SB$break.

... which should be tokenized as:

 This is a text.
 This an escaped percent sign %.
 This is a word-break hint: word-break.
 This is a sentence-break hint: sentence-break.

With token- and sentence boundaries forced at the positions where the respective hints were found in the input.

Cooked Text Files

A "cooked" text file is a text file which encodes information such as token boundaries, sentence boundaries, part-of-speech tag, and/or potential analyses. The moot utilities distinguish between several different types of cooked text file: in order of ascending informational content, these are:

Differnent moot utilities require their input files to be more or less "cooked" -- see the documentation of the individual utilities for details.

Native "cooked" text files are conventionally identified by the filename infix ".moot".

Rare
(-tagged, -analyzed)
Files: *.t, *.r, *.rt

The most basic level of "cookedness", a "rare" text file encodes only token- and sentence-boundaries. By convention, "rare" filenames carry the extension ".t". The syntax is:

 RARE_FILE  ::= {RARE_LINE}*
 RARE_LINE  ::= ({COMMENT} | {EOS} | {RARE_TOKEN}) {NEWLINE}
 COMMENT    ::= {SPACE}* "%%" ([^{NEWLINE}])*
 EOS        ::= ( {SPACE}* {NEWLINE} )+
 RARE_TOKEN ::= {TOKEN_TEXT}
 TOKEN_TEXT ::= ( {WORDCHAR} | {SPACE} )+
 SPACE      ::= " "
 NEWLINE    ::= "\n" | "\r"
 WORDCHAR   ::= [^{SPACE}{NEWLINE}]

Leading and trailing spaces are stripped from token text; it is thus impossible to declare an "empty" token. An example "rare cooked" file is:

 %% Example rare cooked file for moot
 %% Sentence 1
 This
 is
 a
 test
 .
 
 %% Sentence 2
 This
 too
 .
Medium Rare
(-tagged, +analyzed)
Files: *.mr, *.mrt

A "medium rare" file is at least as informative as a "rare" file -- that is, it encodes everything that a "rare" file encodes, and in exactly the same fashion. Additionally, a "medium rare" file may contain for each token a set of (TAB-separated) possible analyses for that token, where an "analysis" contains at least a part-of-speech tag, and possibly also a numeric cost and arbitrary analysis details. Somewhat counter-intuitively, every "rare" file is also a "medium rare" file for which every token is associated with an empty set of possible analyses. By convention, "medium rare" filenames carry the extension ".mrt"

 MED_RARE_FILE  ::= {MED_RARE_LINE}*
 MED_RARE_LINE  ::= ({COMMENT} | {EOS} | {MED_RARE_TOKEN}) {NEWLINE}
 MED_RARE_TOKEN ::= {TOKEN_TEXT} ( {TAB} {ANALYSIS} )*
 ANALYSIS       ::= {DETAIL_PREFIX}? {TAG} {DETAIL_SUFFIX}?
 DETAIL_PREFIX  ::= ( {WORDCHAR} | {SPACE} | {COST} )* ("[" ("_"?))?
 COST           ::= "<" ("-"|"+")? ([0-9]* ".")? [0-9]+ ">"
 TAG            ::= {TAGCHAR}+
 DETAIL_SUFFIX  ::= ( {WORDCHAR} | {SPACE} | {COST} )*
 TAGCHAR        ::= [^{SPACE}{TAB}{NEWLINE}"]"]
 TAB            ::= "\t"

Leading and trailing spaces are stripped from token text, as well as from analysis-detail and -tag text. It should be noted that the TAG component of each ANALYSIS is "greedy" -- if an analysis contains no left-bracket to mark the beginning of a tag, then the whole analysis (up to the first right-bracket or space) is considered the tag. Also, if the analysis contains multiple left-brackets, only the first is considered to introduce the TAG component. An example "medium rare" file is:

 %% Example medium-rare cooked file for moot
 %% Sentence 1 : possible analyses are tags only
 This   NE      PDAT    PDIS
 is     VAFIN   VVFIN
 a      ART     CARD
 test   NN      VVIN
 .      $.
 
 %% Sentence 2 : detailed analyses, with unknown word "foo".
 This   This [NE type="name"] <420>      <24.7> this [_PDAT][_sg]
 foo
 .      . [$.] <-42>

Tokens in "medium rare" files with empty analysis sets (i.e. RARE_TOKENs) are called "unrecognized" tokens.

Medium
(+tagged, -analyzed)
Files: *.tt, *.ttt, *.m, *.mt

A "medium" file can be understood as a "medium rare" file which associates exactly one analysis with each token. The tag for this analysis is considered the "best" tag for the associated token. By convention, "medium" filenames carry the extension ".tt" (tagger output) or ".ttt" (gold standard).

 MEDIUM_FILE    ::= {MEDIUM_LINE}*
 MEDIUM_LINE    ::= ({COMMENT} | {EOS} | {MEDIUM_TOKEN}) {NEWLINE}
 MEDIUM_TOKEN   ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS}
 BEST_ANALYSIS  ::= {ANALYSIS}

As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is "greedy". An example "medium" file is:

 %% Example medium cooked file for moot
 %% Sentence 1 : best analyses are tags only
 This   PDAT
 is     VVFIN
 a      ART
 test   NN
 .      $.
 
 %% Sentence 2 : tags embedded in detailed analyses
 This   <24.7> this [PDAT num="sg"]
 too    <0.0> too [ADV]
 .      <-42> . [$.]
Well Done
(+tagged, +analyzed)
Files: *.wd, *.wdt

A "well done" file can be understood as the synthesis of a "medium rare" and a "medium" file: it contains a "best" analysis for each token (the first one), as well as a set of a priori potential analyses analyses for that token. By convention, "well done" filenames carry the extension ".wd" (tagger output) or ".wdt" (gold standard).

 WELL_DONE_FILE  ::= {WELL_DONE_LINE}*
 WELL_DONE_LINE  ::= ({COMMENT} | {EOS} | {WELL_DONE_TOKEN}) {NEWLINE}
 WELL_DONE_TOKEN ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS} ( {TAB} {ANALYSIS} )*

As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is "greedy". An example "well done" file is:

 %% Example well-done cooked file for moot
 %% Sentence 1 : analysis-set tags bracketed for clarity
 This   PDAT    [NE]    [PDAT]    [PDIS]
 is     VVFIN   [VAFIN] [VVFIN]
 a      ART     [ART]   [CARD]
 test   NN      [NN]    [VVFIN]
 .      $.      [$.]
 
 %% Sentence 2 : analysis-tags embedded in complete analyses
 This   PDAT    [NE type="last"] This <420>  [PDAT num="sg"] this <24.7>
 too    ADV     [ADV] too <0.0>
 .      $.      [$.] . <-42>
Refried
(+tagged, +analyzed, +evaluated)

A "refried" file is basically the synthesis of a pair of "medium" or "well done" files. "Refried" files can be created by the mooteval program from a pair of parallel cooked files. Each line of a "refried" file contains an status code, and a pair of "well-done" style token analyses separated by tabs and a single slash '/'.

 REFRIED_FILE     ::= {REFRIED_LINE}*
 REFRIED_LINE     ::= ( {COMMENT} | {EOS} | {REFRIED_TOKEN} ) {NEWLINE}
 REFRIED_TOKEN    ::= {STATUS_CODE} {TAB} {REFRIED_SOURCES}
 REFRIED_SOURCES  ::= {WELL_DONE_TOKEN} {TAB} "/" {TAB} {WELL_DONE_TOKEN}
 STATUS_CODE      ::= {BASIC_FLAGS} ":" {FILE1_FLAGS} ":" {FILE2_FLAGS}
 BASIC_FLAGS      ::= {TOKMATCH_FLAG} {BESTMATCH_FLAG}
 TOKMATCH_FLAG    ::= "-" | "t"
 BESTMATCH_FLAG   ::= "-" | "b"
 FILE1_FLAGS      ::= {FILE_FLAGS}
 FILE2_FLAGS      ::= {FILE_FLAGS}
 FILE_FLAGS       ::= {EMPTY_FLAG} {IMPOSSIBLE_FLAG} {XIMPOSSIBLE_FLAG}
 EMPTY_FLAG       ::= "-" | "e"
 IMPOSSIBLE_FLAG  ::= "-" | "i"
 XIMPOSSIBLE_FLAG ::= "-" | "x"

As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is "greedy".

The STATUS_CODE component of each REFRIED_TOKEN encodes a number of flags concerning which part (if any) of the tokens compared did not match. The general convention is use of a '-' character to indicate that the compared tokens matched (or at least were compatible).

TOKMATCH_FLAG

'-' if token text components matched, otherwise 't'.

BESTMATCH_FLAG

'-' if best-tag components matched, otherwise 'b'.

EMPTY_FLAG

'-' if token ANALYSES were non-empty (for the given file), otherwise 'e'.

IMPOSSIBLE_FLAG

'-' if token ANALYSES included token BESTTAG (for the corresponding file), otherwise 'i'.

XIMPOSSIBLE_FLAG

'-' if token ANALYSES included token BESTTAG for the other file, otherwise 'x'.

An example "refried" file is:

 %% Example refried file for moot
 %% FLAGS       TOK1    TOK1TAG1 ...            /       TOK2    TOK2TAG1 ...
 %%------------------------------------------------------------------------------------
 t-:---:---     Dis     PDAT    [PDAT]  [PDIS]  /       This    PDAT    [PDAT]  [PDIS]
 -b:---:---     is      VAFIN   [VAFIN] [VVFIN] /       is      VVFIN   [VAFIN] [VVFIN]
 --:e--:---     a       ART     /       a       ART     [ART]   [CARD]
 -b:-i-:---     test    NN      [VVFIN] /       test    VVFIN   [NN]    [VVFIN]
 --:---:---     .       $.      [$.]    /       .       $.      [$.]
 
 -b:--x:---     This    PDAT    [PDAT]  /       This    PDIS    [PDAT]  [PDIS]
 --:---:-ix     too     ADV     [ADV]   [PTKA]  /       too     ADV     [CONJ]
 --:---:e--     .       $.      [$.]    /       .       $.

Re-formatting for better human readabilty produces:

 %% Example refried file for moot
 %% FLAGS       TOK1    TOK1TAG1 ...            /       TOK2    TOK2TAG1 ...
 %%------------------------------------------------------------------------------------
 t-:---:---     Dis     PDAT    [PDAT]  [PDIS]  /       This    PDAT    [PDAT]  [PDIS]
 -b:---:---     is      VAFIN   [VAFIN] [VVFIN] /       is      VVFIN   [VAFIN] [VVFIN]
 --:e--:---     a       ART                     /       a       ART     [ART]   [CARD]
 -b:-i-:---     test    NN      [VVFIN]         /       test    VVFIN   [NN]    [VVFIN]
 --:---:---     .       $.      [$.]            /       .       $.      [$.]
 
 -b:--x:---     This    PDAT    [PDAT]          /       This    PDIS    [PDAT]  [PDIS]
 --:---:-ix     too     ADV     [ADV]   [PTKA]  /       too     ADV     [CONJ]
 --:---:e--     .       $.      [$.]            /       .       $.

XML FILE FORMATS

moot currently uses the (extremely cool and amazingly fast) Expat XML parser library by James Clark for incremental processing of XML documents, (a previous implementation used libxml2 (also extremely cool but not quite as amazingly fast as expat), but the moot libxml2 support is no longer maintained, and is disabled by default), as well as output recoding using librecode by François Pinard. Both expat and librecode support are compile-time options -- check the contents of 'mootConfig.h' to see whether they are enabled on your system.

When working with "cooked" XML (see below), it is critical to remember that the moot internal processing routines always receive token and PoS-tag text encoded in UTF-8, regardless of the document encoding. This is of particular importance when converting from native to XML format i.e. with 'mootchurn' -- it is highly reccommended that you use the 'recode' command-line utility (distributed with 'librecode') to ensure that your native text data is true UTF-8 before passing it to 'mootchurn' for XML output.

Similarly, HMM model data (see "HMM MODEL FILE FORMATS") must be UTF-8 encoded for tagging in XML mode. There is currently no way to directly convert the encoding of a binary model file, but text model files can be converted with the 'recode' command-line utility.

Future implementations might use locale information to (partially) automate the recoding process. If all of your data (training corpus, test corpus, and runtime corpora) are parsed in XML mode, none of the above should present a problem.

XML files are identified by the filename infix '.xml'.

Raw XML Files

A "raw" XML file is just like a "raw" text file. The 'mootpp' program supports rudimentary recognition and removal of (SG|HT|X)ML markup.

Cooked XML Files

As of version 2.0.0, the moot utilities support "cooked" XML files, in addition to the native text format(s). See "Cooked Text Files" above for more details on the native formats and the information content corresponding to the various subtypes.

All "cooked" XML formats share the same structure (much as the "cooked" text formats are defined in terms of one another). The preliminary syntax (subject to change without notice) is:

 COOKED_XML_FILE    ::= {XML_DECL}? {XML_CONTENT}*
 XML_DECL           ::= "<?xml " ... "?>"
 XML_CONTENT        ::= {XML_EOS} | {XML_RAW} | {XML_TOKEN}
 XML_EOS            ::= "<eos/>"
 XML_RAW            ::= ...
 XML_TOKEN          ::= "<token>" {XML_TOKEN_CONTENT} "</token>"
 XML_TOKEN_CONTENT  ::= ({XML_TOKEN_TEXT}
                         | {XML_TOKEN_ANALYSIS}
                         | {XML_TOKEN_BESTTAG}
                         | {XML_RAW})*
 XML_TOKEN_TEXT     ::= "<text>" {TOKEN_TEXT} "</text>"
 XML_TOKEN_BESTTAG  ::= "<moot.tag>" {TOKEN_BESTTAG} "</moot.tag>"
 XML_TOKEN_ANALYSIS ::= '<analysis pos="' {ANALYSIS_TAG} '">' {ANALYSIS_DETAILS} "</analysis>"
 ANALYSIS_DETAILS   ::= {XML_RAW}*

The document structure is thus expected to be something like the following (in a bastard notation born of BNF and XPath):

 SENTENCE_BOUNDARY  ::= //eos                            # really only end-elts
 TOKEN_TEXT         ::= //token//text/text()             # should be accurate
 ANALYIS_TAG        ::= //token//analysis/@pos           # uses attribute value (not full node)
 ANALYSIS_DETAILS   ::= //token//analysis/text()         # buggy -- actually ignored!
 TOKEN_BESTTAG      ::= //token//moot.tag[last()]/text() # should be accurate

Contact the author if you need any of the following done:

TODO

Pull up literal element name parameters from TokenReaderExpat to user-level.

TODO

Add a DTD for the default XML format to the distribution.

An example "cooked" XML document is the following:

 <?xml version="1.0"?>
 <doc>
  <!-- Sentence-1 : Well Done, Medium, and Medium Rare -->
  <token>
    <!-- A 'well done' token with minimal structure -->
    <text>This</text>
    <moot.tag>PDAT</moot.tag>
    <analysis pos="NE"/>
    <analysis pos="NN"/>
    <analysis pos="PDAT"/>
    <analysis pos="PDS"/>
  </token>
  <token>
    <!-- A 'well done' token with extra structure -->
    <text>is</text>
    <extraneous.element>
      <analysis pos="VAFIN"/>
      <moot.tag>VVFIN</moot.tag>
      <analysis pos="VVFIN"/>
    </extraneous.element>
  </token>
  <token>
    <!-- Yet another 'well done' token  -->
    <text>a</text>
    <other_extraneous_element>
      <analysis pos="ART"/>
    </other_extraneous_element>
    <moot.tag>ART</moot.tag>
  </token>
  <token>
    <!-- A 'medium' token -->
    <text>Test</text>
    <moot.tag>NN</moot.tag>
  </token>
  <token>
    <!-- A 'Medium Rare' token -->
    <text>.</text>
    <analysis pos="$."/>
  </token>
  <eos/>
  <!-- Sentence-2 : Rare tokens only -->
  <token><text>This</text></token>
  <token><text>too</text></token>
  <token><text>.</text></token>
  <eos/>
 </doc>

I/O Format Flags

Several moot utilities are capable of processing input in a number of different formats, typically specified by '--input-format' (-I) and '--output-format' (-O) command-line options The following list briefly describes the (case-insensitive) format flags which may occur as individual elements of the comma-separated list passed as an argument to these format options. Each format flag may be preceeded by an exclamation point "!" to indicate the negation of the respective format property. Note that at the current time, not all formats support all available flags.

If no format flags are specified by the user, the moot utilities will attempt to guess an appropriate format based on the filename and on the requirements for the particular utility in question.

Basic Flags
None

No flags at all. This should never really happen at runtime, and should cause a default format to be assumed and/or an appropriate format to be guessed from the relevant filename(s).

Null

If you specify 'null' as an output format, no output will actually be written (useful for testing and benchmarking the input layer).

Unknown

Unknown format. This should never ever happen, and should cause a reversion to some default format.

Native

Specifies native text format I/O, as opposed to XML.

XML

Specifies XML format I/O, as opposed to a native text format.

Pretty

Beautified XML format. Useful for human-readable XML output. Not all XML I/O modes support cosmetic surgery.

Conserve

Conservative XML format: attempt to preserve as much of the input document structure as possible. Only meaningful if both XML input and XML output are requested.

Text

Read/write token text (all formats).

Analyzed

Read/write token analyses ('medium rare' or 'well done' formats only).

Tagged

Read/write 'best tags' ('medium' or 'well done' formats only).

Location

Read/write token locations as logical pairs (BYTE_OFFSET,BYTE_LENGTH) from/to the input stream as the first non-tag analysis. Useful if you need to refer back to earlier stages of a token processing pipeline.

Cost

Read/write analysis "costs" from/to analysis "<NUMBER>" suffixes. This flag may be set by default in future versions.

Pruned

For 'well done' formats, ignore analyses which do not correspond to the 'best' tag.

Trace

If set as an output format flag, causes a verbose dump of the Viterbi trellis to be spliced into every tagged sentence as post-token comments. Does nothing as an input flag (yet). Implies "Flush".

Predict

If set as an output flag, cases a verbose dump of Viterbi trellis-based predictions to be spliced into every tagged sentence as post-token comments. Does nothing as an input flag (yet). Implies both "Trace" and "Flush".

Flush

If set as an output flag, causes the underlying output stream to be implicitly flushed after each write operation. Currently only meaningful for native output mode. Does nothing as an input flag (yet).

Compound Flags
Rare
R

Alias for 'Text'.

MediumRare
MR

Alias for 'Text,Analyzed'.

Medium
M

Alias for 'Text,Tagged'.

WellDone
WD

Alias for 'Text,Tagged,Analyzed'

Examples

HMM MODEL FILE FORMATS

The moothmm(1) program can use either text- or native binary-format model files, which encode raw frequency counts (text model files), or probability tables and compile-time flags for the Hidden Markov Model (binary model files), respectively.

Text Models

A "Text Model" is completely specified by up to four files: a lexical freqency file (*.lex), an n-gram frequency file (*.123), an optional lexical-class frequency file (*.clx), and an optional surface/typographical heuristic `flavor' rule file (*.fla).

When specifiying a text model name to a moot utility program, you may specify the model name as TMODEL in order to use the files TMODEL.lex , TMODEL.123 , TMODEL.clx , and TMODEL.fla (if present). Otherwise, you may specifiy a composite model name as a comma-separated list of the individual component filenames: mylex.lex,myngrams.123,myclasses.clx,myclasses.fla. Any positional field in the specification may be left blank to omit loading the associated data; e.g. to omit lexical classes but include flavor definitions, you can specify a model as mylex.lex,myngrams.123,,myclasses.fla.

Lexical Frequency Files

Lexical frequency files store raw frequencies for known tokens and (token,tag) pairs. The format use is ca. 99.998% compatible with that generated by the tnt-para(1) program:

 LEX_FILE    ::= ({COMMENT} | {BLANK_LINE} | {LEX_ENTRY})*
 COMMENT     ::= {SPACE}* "%%" ([^{NEWLINE}])*  {NEWLINE}
 BLANK_LINE  ::= {SPACE}* {NEWLINE}
 LEX_ENTRY   ::= {TOKEN_TEXT} {TAB} {TOKEN_TOTAL} ( {TAB} {TAG_COUNT} )*
 TAG_COUNT   ::= {TAG_TEXT} {TAB} {TOK_TAG_CT}
 TOKEN_TOTAL ::= {COUNT}
 TOK_TAG_CT  ::= {COUNT}
 TOKEN_TEXT  ::= {STRING} | {SPECIAL_TOK}
 TAG_TEXT    ::= {STRING}
 STRING      ::= ( [^{TAB}{NEWLINE}] )+
 COUNT       ::=  ("-"|"+")? ([0-9]* ".")? [0-9]+
 NEWLINE     ::= "\n" | "\r"
 TAB         ::= "\t"
 SPECIAL_TOK ::= "@UNKNOWN"
                 | {FLAVOR_LABEL}

Leading and trailing spaces are stripped from token and tag text.

The special tokens whose text conventionally begins with an '@' character declare counts for special pseudo-tokens. In particular, the entry for @UNKNOWN -- if it exists -- declares frequency counts to be used when no other training data is available (i.e. for alphabetic tokens which did not occur in the training corpus).

Other known pseudo-tokens represent training counts to use for unknown tokens which match one of the model's typographical classification rules. The entries are identified in the lexical frequency file by the flavor's label (e.g. "@CARD", "@CARDSEPS", "@CARDPUNCT", or "@CARDSUFFIX" for the default built-in flavor rules). See "Flavor Definition Files" for details on the typographical classification heuristics supported by moot.

An example lexical frequency file is:

 %% Example lexical frequency file
 This   4       PDAT    4
 is     1.0     VVFIN   0.7     VAFIN   0.3
 a      365     ART     350     CARD    5
 test   1       NN      0.5     VVFIN   0.5
 too    1       ADV     1
 .      42      $.      42
Ngram Frequency Files

An n-gram frequency file stores raw frequency counts for uni-, bi-, and tri-grams. An n-gram file may be in either "long" or "short" format, both of which are compatible with the respective formats produced by the tnt-para(1) program:

 NGRAM_FILE  ::= ({COMMENT} | {BLANK_LINE} | {NGRAM_ENTRY})*
 COMMENT     ::= {SPACE}* "%%" ([^{NEWLINE}])*  {NEWLINE}
 BLANK_LINE  ::= {SPACE}* {NEWLINE}
 NGRAM_ENTRY ::= {UNIGRAM} | {BIGRAM} | {TRIGRAM}
 UNIGRAM     ::= {TAG} {TAB} {COUNT}
 BIGRAM      ::= {TAG} {TAB} {TAG} {TAB} {COUNT}
 TRIGRAM     ::= {TAG} {TAB} {TAG} {TAB} {TAG} {TAB} {COUNT}
 TAG         ::= EOS_TAG | ( [^{TAB}{NEWLINE}] )*
 EOS_TAG     ::= "__$"
 COUNT       ::=  ("-"|"+")? ([0-9]* ".")? [0-9]+
 NEWLINE     ::= "\n" | "\r"
 TAB         ::= "\t"

Leading and trailing spaces are stripped from tags. An empty TAG component is populated with the tag in the corresponding position from the last n-gram parsed -- exhaustive use of this feature produces "short" format n-gram files. Non-use of this feature produces "long" format n-gram files.

An example "long" format n-gram file is:

 %% Example n-gram frequency file in "long" format
 __$    2
 __$    PDAT    2
 __$    PDAT    VVFIN   1
 __$    PDAT    ADV     1
 ADV    1
 ADV    $.      1
 ADV    $.      __$     1
 ART    1
 ART    NN      1
 ART    NN      $.      1
 PDAT   2
 PDAT   VVFIN   1
 PDAT   VVFIN   ART     1
 PDAT   ADV     1
 PDAT   ADV     $.      1
 VVFIN  1
 VVFIN  ART     1
 VVFIN  ART     NN      1
 NN     1
 NN     $.      1
 NN     $.      __$     1

The same data in "short" format:

 %% Example n-gram frequency file in "short" format
 __$    2
        PDAT    2
                VVFIN   1
                ADV     1
 ADV    1
        $.      1
                __$     1
 ART    1
                1
                $.      1
 PDAT   2
        VVFIN   1
                ART     1
        ADV     1
                $.      1
 VVFIN  1
        ART     1
                NN      1
 NN     1
        $.      1
                __$     1
Lexical-Class Frequency Files

Lexical-class frequency files store raw frequencies for known lexical classes (read "sets of possible part-of-speech tags") and (class,tag) pairs. The format is a direct extension of the format for lexical frequency files (see "Lexical Frequency Files", above):

 CLASS_FILE  ::= ({COMMENT} | {BLANK_LINE} | {CLASS_ENTRY})*
 CLASS_ENTRY ::= {CLASS_ELTS} {TAB} {CLASS_TOTAL} ( {TAB} {TAG_COUNT} )*
 CLASS_ELTS  ::= ( {CLASS_TAG} {SPACE} )*
 CLASS_TAG   ::= ( [^{SPACE}{TAB}{NEWLINE}] )+

As for lexical frequency files, leading and trailing whitespaces are stripped from class and tag text.

The CLASS_ELTS component specifies a (space-separated) list of tags belonging to the lexical class. All other (tab-separated) fields are as for a lexical frequency file.

A pair (CLASS,TAG) such that TAG is not an element of CLASS is called an "contradictory pair" or an "impossible pair". It is not required that the the tags in the TAG_COUNT components of a CLASS_ENTRY are "possible" in this sense, although it certainly helps if this is the case.

An example lexical class frequency file is:

 %% Example lexical frequency file
 PDAT NE        4       PDAT    4
 VVFIN VAFIN    1.0     VVFIN   0.7     VAFIN   0.3
 ART CARD       365     ART     350     CARD    5
 NN VVFIN       1       NN      0.5     VVFIN   0.5
 ADV            1       ADV     1
 $.             42      $.      42
Flavor Definition Files

A flavor definition file stores heuristic rules for surface-typographical classification of otherwise unknown tokens via the C++ mootTaster API. The syntax for flavor definition files is:

 FLAVOR_FILE   ::= ({COMMENT} | {NOTAB_LINE} | {CONTENT_LINE})*
 COMMENT       ::= "%%" ([^{NEWLINE}])* {NEWLINE}
 NOTAB_LINE    ::= ([^{TAB}])*
 CONTENT_LINE  ::= {DEFAULT_LINE} | {FLAVOR_RULE}
 DEFAULT_LINE  ::= "DEFAULT" {TAB} {DEFAULT_LABEL}
 DEFAULT_LABEL ::= {LABEL}
 FLAVOR_RULE   ::= {LABEL} {TAB} {REGEX}
 NEWLINE       ::= "\n"
 TAB           ::= "\t"
 LABEL         ::= [^\t\r\n]*
 REGEX         ::= [^\r\r\n]+

Content lines are those non-comment lines containing at least one TAB character. A content line may define a default label DEFAULT_LABEL for the classifier (if unspecified, the default label defaults to the empty string), or an explicit classification rule. Each explicit classification rule has a label LABEL as well as an associated POSIX.2 regular expression REGEX (see regex(7)). By convention, flavor LABEL values begin with "@" and otherwise contain only upper-case ASCII characters, but thse conventions are currently not enforced. Flavor definitions are used by the runtime tagger to obtain lexical probability estimates for otherwise unknown input tokens in the following manner:

If no flavor definition file is specified, moot uses a built-in set of classification heuristics equivalent to the following flavor definition file:

 %% Flavor definition file for moot
 %%   LC_CTYPE=UTF-8
 @ALPHA         ^[^0-9]
 @CARD          ^([0-9]+)$
 @CARDPUNCT     ^([0-9]+)([,\.\-])$
 @CARDSEPS      ^([0-9])([0-9,\.\-]+)$
 @CARDSUFFIX    ^([0-9])([0-9,\.\-]*)([^0-9,\.\-])(.{0,3})$
 DEFAULT        @ALPHA

Note that the process locale settings (see locale(1), setlocale(3), locale(7)) -- in particular the value of LC_CTYPE -- can and do influence the behavior of the regular expression matching engine. The command-line moot utilities initialize the locale according to the user's environment variables on start-up. Expect the unexpected if your runtime locale settings differ from those used when training the model, however. For reference, the value of LC_CTYPE at training time is written as a comment to the flavor definition files produced by mootrain, but this comment in and of itself has no effect on the operation of the runtime tagger; the user is responsible for ensuring that his or her locale settings are compatible with those expected by the model.

The program moottaste(1) can be used to test and debug flavor definition files.

HMM Binary Model Files

A "Binary Model" BINMODEL is a (compressed) binary format file storing a compiled Hidden Markov Model (probabilities and constants). It is completely specified by its filename BINMODEL. By convention, HMM binary model files carry the suffix ".hmm".

When specifying an HMM model file, note that the existence of a file BINMODEL overrides any text models which might exists in files BINMODEL.lex , BINMODEL.123 , BINMODEL.clx. Use of a conventional suffix (such as ".hmm") to identify binary models eliminates such problems, since MODEL.hmm will not clash with a text model MODEL.lex, ...

HMM Dumps

An HMM dump is a plain text file containing all the information stored in a compiled HMM. The format exists solely for purposes of debugging.

ACKNOWLEDGEMENTS

Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.

I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.

Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used in the development of the class-based HMM tagger / disambiguator. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development of the class-based HMM tagger / disambiguator.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

mootutils