moot FILE FORMATS

This manpage describes various file formats used by the moot PoS tagging utilities.

PROGRAM CONFIGURATION FILES

Most moot utility programs support global and user-specific configuration files which can be used to set system defaults and/or user preferences for values of program options.

Configuration files are expected to contain lines of the form:

 LONG_OPTION_NAME    OPTION_VALUE

where LONG_OPTION_NAME is the long name of one of the program's options, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with a '#' character) are ignored.

The following configuration files are read by default, where ${PROGNAME} is the name of a moot utility program, and ${HOME} is the home directory of the current user:

/etc/${PROGNAME}rc

System defaults file; read first.

${HOME}/.${PROGNAME}rc

User preferences file, can be used to override system defaults.

Any options specified on the command-line override defaults from a program configuration file.

TEXT FILE FORMATS (NATIVE)

Raw Text Files

A "raw" text file is just that: any file consiting of (8-bit or variable-width encoded) characters. Such files may be processed by the mootpp preprocessor to produce "rare cooked" (-tagged, -analyzed) text files, or by the waste tokenizer with an appropriate tokenization model. An example "raw" text file is:

 This is a test.  This too.

Cooked Text Files

A "cooked" text file is a text file which encodes information such as token boundaries, sentence boundaries, part-of-speech tag, and/or potential analyses. The moot utilities distinguish between several different types of cooked text file: in order of ascending informational content, these are:

Differnent moot utilities require their input files to be more or less "cooked" -- see the documentation of the individual utilities for details.

Native "cooked" text files are conventionally identified by the filename infix ".moot".

XML FILE FORMATS

moot currently uses the (extremely cool and amazingly fast) Expat XML parser library by James Clark for incremental processing of XML documents, (a previous implementation used libxml2 (also extremely cool but not quite as amazingly fast as expat), but the moot libxml2 support is no longer maintained, and is disabled by default), as well as output recoding using librecode by François Pinard. Both expat and librecode support are compile-time options -- check the contents of 'mootConfig.h' to see whether they are enabled on your system.

When working with "cooked" XML (see below), it is critical to remember that the moot internal processing routines always receive token and PoS-tag text encoded in UTF-8, regardless of the document encoding. This is of particular importance when converting from native to XML format i.e. with 'mootchurn' -- it is highly reccommended that you use the 'recode' command-line utility (distributed with 'librecode') to ensure that your native text data is true UTF-8 before passing it to 'mootchurn' for XML output.

Similarly, HMM model data (see "HMM MODEL FILE FORMATS") must be UTF-8 encoded for tagging in XML mode. There is currently no way to directly convert the encoding of a binary model file, but text model files can be converted with the 'recode' command-line utility.

Future implementations might use locale information to (partially) automate the recoding process. If all of your data (training corpus, test corpus, and runtime corpora) are parsed in XML mode, none of the above should present a problem.

XML files are identified by the filename infix '.xml'.

Raw XML Files

A "raw" XML file is just like a "raw" text file. The 'mootpp' program supports rudimentary recognition and removal of (SG|HT|X)ML markup.

Cooked XML Files

As of version 2.0.0, the moot utilities support "cooked" XML files, in addition to the native text format(s). See "Cooked Text Files" above for more details on the native formats and the information content corresponding to the various subtypes.

All "cooked" XML formats share the same structure (much as the "cooked" text formats are defined in terms of one another). The preliminary syntax (subject to change without notice) is:

 COOKED_XML_FILE    ::= {XML_DECL}? {XML_CONTENT}*
 XML_DECL           ::= "<?xml " ... "?>"
 XML_CONTENT        ::= {XML_EOS} | {XML_RAW} | {XML_TOKEN}
 XML_EOS            ::= "<eos/>"
 XML_RAW            ::= ...
 XML_TOKEN          ::= "<token>" {XML_TOKEN_CONTENT} "</token>"
 XML_TOKEN_CONTENT  ::= ({XML_TOKEN_TEXT}
                         | {XML_TOKEN_ANALYSIS}
                         | {XML_TOKEN_BESTTAG}
                         | {XML_RAW})*
 XML_TOKEN_TEXT     ::= "<text>" {TOKEN_TEXT} "</text>"
 XML_TOKEN_BESTTAG  ::= "<moot.tag>" {TOKEN_BESTTAG} "</moot.tag>"
 XML_TOKEN_ANALYSIS ::= '<analysis pos="' {ANALYSIS_TAG} '">' {ANALYSIS_DETAILS} "</analysis>"
 ANALYSIS_DETAILS   ::= {XML_RAW}*

The document structure is thus expected to be something like the following (in a bastard notation born of BNF and XPath):

 SENTENCE_BOUNDARY  ::= //eos                            # really only end-elts
 TOKEN_TEXT         ::= //token//text/text()             # should be accurate
 ANALYIS_TAG        ::= //token//analysis/@pos           # uses attribute value (not full node)
 ANALYSIS_DETAILS   ::= //token//analysis/text()         # buggy -- actually ignored!
 TOKEN_BESTTAG      ::= //token//moot.tag[last()]/text() # should be accurate

Contact the author if you need any of the following done:

TODO

Pull up literal element name parameters from TokenReaderExpat to user-level.

TODO

Add a DTD for the default XML format to the distribution.

An example "cooked" XML document is the following:

 <?xml version="1.0"?>
 <doc>
  <!-- Sentence-1 : Well Done, Medium, and Medium Rare -->
  <token>
    <!-- A 'well done' token with minimal structure -->
    <text>This</text>
    <moot.tag>PDAT</moot.tag>
    <analysis pos="NE"/>
    <analysis pos="NN"/>
    <analysis pos="PDAT"/>
    <analysis pos="PDS"/>
  </token>
  <token>
    <!-- A 'well done' token with extra structure -->
    <text>is</text>
    <extraneous.element>
      <analysis pos="VAFIN"/>
      <moot.tag>VVFIN</moot.tag>
      <analysis pos="VVFIN"/>
    </extraneous.element>
  </token>
  <token>
    <!-- Yet another 'well done' token  -->
    <text>a</text>
    <other_extraneous_element>
      <analysis pos="ART"/>
    </other_extraneous_element>
    <moot.tag>ART</moot.tag>
  </token>
  <token>
    <!-- A 'medium' token -->
    <text>Test</text>
    <moot.tag>NN</moot.tag>
  </token>
  <token>
    <!-- A 'Medium Rare' token -->
    <text>.</text>
    <analysis pos="$."/>
  </token>
  <eos/>
  <!-- Sentence-2 : Rare tokens only -->
  <token><text>This</text></token>
  <token><text>too</text></token>
  <token><text>.</text></token>
  <eos/>
 </doc>

I/O Format Flags

Several moot utilities are capable of processing input in a number of different formats, typically specified by '--input-format' (-I) and '--output-format' (-O) command-line options The following list briefly describes the (case-insensitive) format flags which may occur as individual elements of the comma-separated list passed as an argument to these format options. Each format flag may be preceeded by an exclamation point "!" to indicate the negation of the respective format property. Note that at the current time, not all formats support all available flags.

If no format flags are specified by the user, the moot utilities will attempt to guess an appropriate format based on the filename and on the requirements for the particular utility in question.

Basic Flags
None

No flags at all. This should never really happen at runtime, and should cause a default format to be assumed and/or an appropriate format to be guessed from the relevant filename(s).

Null

If you specify 'null' as an output format, no output will actually be written (useful for testing and benchmarking the input layer).

Unknown

Unknown format. This should never ever happen, and should cause a reversion to some default format.

Native

Specifies native text format I/O, as opposed to XML.

XML

Specifies XML format I/O, as opposed to a native text format.

Pretty

Beautified XML format. Useful for human-readable XML output. Not all XML I/O modes support cosmetic surgery.

Conserve

Conservative XML format: attempt to preserve as much of the input document structure as possible. Only meaningful if both XML input and XML output are requested.

Text

Read/write token text (all formats).

Analyzed

Read/write token analyses ('medium rare' or 'well done' formats only).

Tagged

Read/write 'best tags' ('medium' or 'well done' formats only).

Location

Read/write token locations as logical pairs (BYTE_OFFSET,BYTE_LENGTH) from/to the input stream as the first non-tag analysis. Useful if you need to refer back to earlier stages of a token processing pipeline.

Cost

Read/write analysis "costs" from/to analysis "<NUMBER>" suffixes. This flag may be set by default in future versions.

Pruned

For 'well done' formats, ignore analyses which do not correspond to the 'best' tag.

Trace

If set as an output format flag, causes a verbose dump of the Viterbi trellis to be spliced into every tagged sentence as post-token comments. Does nothing as an input flag (yet). Implies "Flush".

Predict

If set as an output flag, cases a verbose dump of Viterbi trellis-based predictions to be spliced into every tagged sentence as post-token comments. Does nothing as an input flag (yet). Implies both "Trace" and "Flush".

Flush

If set as an output flag, causes the underlying output stream to be implicitly flushed after each write operation. Currently only meaningful for native output mode. Does nothing as an input flag (yet).

Compound Flags
Rare
R

Alias for 'Text'.

MediumRare
MR

Alias for 'Text,Analyzed'.

Medium
M

Alias for 'Text,Tagged'.

WellDone
WD

Alias for 'Text,Tagged,Analyzed'

Examples

HMM MODEL FILE FORMATS

The moothmm(1) program can use either text- or native binary-format model files, which encode raw frequency counts (text model files), or probability tables and compile-time flags for the Hidden Markov Model (binary model files), respectively.

Text Models

A "Text Model" is completely specified by up to four files: a lexical freqency file (*.lex), an n-gram frequency file (*.123), an optional lexical-class frequency file (*.clx), and an optional surface/typographical heuristic `flavor' rule file (*.fla).

When specifiying a text model name to a moot utility program, you may specify the model name as TMODEL in order to use the files TMODEL.lex , TMODEL.123 , TMODEL.clx , and TMODEL.fla (if present). Otherwise, you may specifiy a composite model name as a comma-separated list of the individual component filenames: mylex.lex,myngrams.123,myclasses.clx,myclasses.fla. Any positional field in the specification may be left blank to omit loading the associated data; e.g. to omit lexical classes but include flavor definitions, you can specify a model as mylex.lex,myngrams.123,,myclasses.fla.

HMM Binary Model Files

A "Binary Model" BINMODEL is a (compressed) binary format file storing a compiled Hidden Markov Model (probabilities and constants). It is completely specified by its filename BINMODEL. By convention, HMM binary model files carry the suffix ".hmm".

When specifying an HMM model file, note that the existence of a file BINMODEL overrides any text models which might exists in files BINMODEL.lex , BINMODEL.123 , BINMODEL.clx. Use of a conventional suffix (such as ".hmm") to identify binary models eliminates such problems, since MODEL.hmm will not clash with a text model MODEL.lex, ...

HMM Dumps

An HMM dump is a plain text file containing all the information stored in a compiled HMM. The format exists solely for purposes of debugging.

ACKNOWLEDGEMENTS

Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( "collocations in the dictionary", http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( "digital dictionary of the German language of the 20th century", http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.

I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.

Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used in the development of the class-based HMM tagger / disambiguator. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development of the class-based HMM tagger / disambiguator.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

mootutils

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 126:

You can't have =items (as at line 134) unless the first thing after the =over is an =item

Around line 826:

You can't have =items (as at line 834) unless the first thing after the =over is an =item