This manpage describes various file formats used by the moot PoS tagging utilities.
Most moot utility programs support global and user-specific configuration files which can be used to set system defaults and/or user preferences for values of program options.
Configuration files are expected to contain lines of the form:
LONG_OPTION_NAME OPTION_VALUE
where LONG_OPTION_NAME is the long name of one of the program's options, without the leading '--', and OPTION_VALUE is the value for that option, if any. Fields are whitespace-separated. Blank lines and comments (lines beginning with a '#' character) are ignored.
The following configuration files are read by default, where ${PROGNAME} is the name of a moot utility program, and ${HOME} is the home directory of the current user:
Any options specified on the command-line override defaults from a program configuration file.
A ``raw'' text file is just that: any file consiting of (8-bit) characters. Such files may be processed by the the mootpp manpage preprocessor to produce ``rare cooked'' (-tagged, -analyzed) text files. An example ``raw'' text file is:
This is a test. This too.
A ``cooked'' text file is a text file which encodes information such as token boundaries, sentence boundaries, part-of-speech tag, and/or potential analyses. The moot utilities distinguish between several different types of cooked text file: in order of ascending informational content, these are: ``rare'' (-tagged, -analyzed), ``medium rare'' (-tagged, +analyzed), ``medium'' (+tagged, -analyzed), ``well done'' (+tagged, +analyzed), and ``refried'' (+tagged, +analyzed, +evaluated). Differnent moot utilities require their input files to be more or less ``cooked'' -- see the documentation of the individual utilities for details.
Native ``cooked'' text files are conventionally identified by the filename infix ``.moot''.
RARE_FILE ::= {RARE_LINE}* RARE_LINE ::= ({COMMENT} | {EOS} | {RARE_TOKEN}) {NEWLINE} COMMENT ::= {SPACE}* "%%" ([^{NEWLINE}])* EOS ::= ( {SPACE}* {NEWLINE} )+ RARE_TOKEN ::= {TOKEN_TEXT} TOKEN_TEXT ::= ( {WORDCHAR} | {SPACE} )+ SPACE ::= " " NEWLINE ::= "\n" | "\r" WORDCHAR ::= [^{SPACE}{NEWLINE}]
Leading and trailing spaces are stripped from token text; it is thus impossible to declare an ``empty'' token. An example ``rare cooked'' file is:
%% Example rare cooked file for moot %% Sentence 1 This is a test .
%% Sentence 2 This too .
MED_RARE_FILE ::= {MED_RARE_LINE}* MED_RARE_LINE ::= ({COMMENT} | {EOS} | {MED_RARE_TOKEN}) {NEWLINE} MED_RARE_TOKEN ::= {TOKEN_TEXT} ( {TAB} {ANALYSIS} )* ANALYSIS ::= {DETAIL_PREFIX}? {TAG} {DETAIL_SUFFIX}? DETAIL_PREFIX ::= ( {WORDCHAR} | {SPACE} | {COST} )* ("[" ("_"?))? COST ::= "<" ("-"|"+")? ([0-9]* ".")? [0-9]+ ">" TAG ::= {TAGCHAR}+ DETAIL_SUFFIX ::= ( {WORDCHAR} | {SPACE} | {COST} )* TAGCHAR ::= [^{SPACE}{TAB}{NEWLINE}"]"] TAB ::= "\t"
Leading and trailing spaces are stripped from token text, as well as from analysis-detail and -tag text. It should be noted that the TAG component of each ANALYSIS is ``greedy'' -- if an analysis contains no left-bracket to mark the beginning of a tag, then the whole analysis (up to the first right-bracket or space) is considered the tag. Also, if the analysis contains multiple left-brackets, only the first is considered to introduce the TAG component. An example ``medium rare'' file is:
%% Example medium-rare cooked file for moot %% Sentence 1 : possible analyses are tags only This NE PDAT PDIS is VAFIN VVFIN a ART CARD test NN VVIN . $.
%% Sentence 2 : detailed analyses, with unknown word "foo". This This [NE type="name"] <420> <24.7> this [_PDAT][_sg] foo . . [$.] <-42>
Tokens in ``medium rare'' files with empty analysis sets (i.e. RARE_TOKENs) are called ``unrecognized'' tokens.
MEDIUM_FILE ::= {MEDIUM_LINE}* MEDIUM_LINE ::= ({COMMENT} | {EOS} | {MEDIUM_TOKEN}) {NEWLINE} MEDIUM_TOKEN ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS} BEST_ANALYSIS ::= {ANALYSIS}
As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is ``greedy''. An example ``medium'' file is:
%% Example medium cooked file for moot %% Sentence 1 : best analyses are tags only This PDAT is VVFIN a ART test NN . $.
%% Sentence 2 : tags embedded in detailed analyses This <24.7> this [PDAT num="sg"] too <0.0> too [ADV] . <-42> . [$.]
WELL_DONE_FILE ::= {WELL_DONE_LINE}* WELL_DONE_LINE ::= ({COMMENT} | {EOS} | {WELL_DONE_TOKEN}) {NEWLINE} WELL_DONE_TOKEN ::= {TOKEN_TEXT} {TAB} {BEST_ANALYSIS} ( {TAB} {ANALYSIS} )*
As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is ``greedy''. An example ``well done'' file is:
%% Example well-done cooked file for moot %% Sentence 1 : analysis-set tags bracketed for clarity This PDAT [NE] [PDAT] [PDIS] is VVFIN [VAFIN] [VVFIN] a ART [ART] [CARD] test NN [NN] [VVFIN] . $. [$.]
%% Sentence 2 : analysis-tags embedded in complete analyses This PDAT [NE type="last"] This <420> [PDAT num="sg"] this <24.7> too ADV [ADV] too <0.0> . $. [$.] . <-42>
REFRIED_FILE ::= {REFRIED_LINE}* REFRIED_LINE ::= ( {COMMENT} | {EOS} | {REFRIED_TOKEN} ) {NEWLINE} REFRIED_TOKEN ::= {STATUS_CODE} {TAB} {REFRIED_SOURCES} REFRIED_SOURCES ::= {WELL_DONE_TOKEN} {TAB} "/" {TAB} {WELL_DONE_TOKEN} STATUS_CODE ::= {BASIC_FLAGS} ":" {FILE1_FLAGS} ":" {FILE2_FLAGS} BASIC_FLAGS ::= {TOKMATCH_FLAG} {BESTMATCH_FLAG} TOKMATCH_FLAG ::= "-" | "t" BESTMATCH_FLAG ::= "-" | "b" FILE1_FLAGS ::= {FILE_FLAGS} FILE2_FLAGS ::= {FILE_FLAGS} FILE_FLAGS ::= {EMPTY_FLAG} {IMPOSSIBLE_FLAG} {XIMPOSSIBLE_FLAG} EMPTY_FLAG ::= "-" | "e" IMPOSSIBLE_FLAG ::= "-" | "i" XIMPOSSIBLE_FLAG ::= "-" | "x"
As before, leading and trailing spaces are stripped from token text and analyses, and the TAG component of each ANALYSIS is ``greedy''.
The STATUS_CODE component of each REFRIED_TOKEN encodes a number of flags concerning which part (if any) of the tokens compared did not match. The general convention is use of a '-' character to indicate that the compared tokens matched (or at least were compatible).
An example ``refried'' file is:
%% Example refried file for moot %% FLAGS TOK1 TOK1TAG1 ... / TOK2 TOK2TAG1 ... %%------------------------------------------------------------------------------------ t-:---:--- Dis PDAT [PDAT] [PDIS] / This PDAT [PDAT] [PDIS] -b:---:--- is VAFIN [VAFIN] [VVFIN] / is VVFIN [VAFIN] [VVFIN] --:e--:--- a ART / a ART [ART] [CARD] -b:-i-:--- test NN [VVFIN] / test VVFIN [NN] [VVFIN] --:---:--- . $. [$.] / . $. [$.] -b:--x:--- This PDAT [PDAT] / This PDIS [PDAT] [PDIS] --:---:-ix too ADV [ADV] [PTKA] / too ADV [CONJ] --:---:e-- . $. [$.] / . $.
moot currently uses the (extremely cool and amazingly fast) Expat XML parser library by James Clark for incremental processing of XML documents, (a previous implementation used libxml2 (also extremely cool but not quite as amazingly fast as expat), but the moot libxml2 support is no longer maintained, and is disabled by default), as well as output recoding using librecode by François Pinard. Both expat and librecode support are compile-time options -- check the contents of 'mootConfig.h' to see whether they are enabled on your system.
When working with ``cooked'' XML (see below), it is critical to remember that the moot internal processing routines always receive token and PoS-tag text encoded in UTF-8, regardless of the document encoding. This is of particular importance when converting from native to XML format i.e. with 'mootchurn' -- it is highly reccommended that you use the 'recode' command-line utility (distributed with 'librecode') to ensure that your native text data is true UTF-8 before passing it to 'mootchurn' for XML output. Future implementations might use locale information to automate this process. If all of your data (training corpus, test corpus, and runtime corpora) are parsed in XML mode, none of the above should present a problem.
XML files are identified by the filename infix '.xml'.
A ``raw'' XML file is just like a ``raw'' text file. The 'mootpp' program supports rudimentary recognition and removal of (SG|HT|X)ML markup.
As of version 2.0.0, the moot utilities support ``cooked'' XML files, in addition to the native text format(s). See ``Cooked Text Files'' above for more details on the native formats and the information content corresponding to the various subtypes.
All ``cooked'' XML formats share the same structure (much as the ``cooked'' text formats are defined in terms of one another). The preliminary syntax (subject to change without notice) is:
COOKED_XML_FILE ::= {XML_DECL}? {XML_CONTENT}* XML_DECL ::= "<?xml " ... "?>" XML_CONTENT ::= {XML_EOS} | {XML_RAW} | {XML_TOKEN} XML_EOS ::= "<eos/>" XML_RAW ::= ... XML_TOKEN ::= "<token>" {XML_TOKEN_CONTENT} "</token>" XML_TOKEN_CONTENT ::= ({XML_TOKEN_TEXT} | {XML_TOKEN_ANALYSIS} | {XML_TOKEN_BESTTAG} | {XML_RAW})* XML_TOKEN_TEXT ::= "<text>" {TOKEN_TEXT} "</text>" XML_TOKEN_BESTTAG ::= "<moot.tag>" {TOKEN_BESTTAG} "</moot.tag>" XML_TOKEN_ANALYSIS ::= '<analysis pos="' {ANALYSIS_TAG} '">' {ANALYSIS_DETAILS} "</analysis>" ANALYSIS_DETAILS ::= {XML_RAW}*
The document structure is thus expected to be something like the following (in a bastard notation born of BNF and XPath):
SENTENCE_BOUNDARY ::= //eos # really only end-elts TOKEN_TEXT ::= //token//text/text() # should be accurate ANALYIS_TAG ::= //token//analysis/@pos # uses attribute value (not full node) ANALYSIS_DETAILS ::= //token//analysis/text() # buggy -- actually ignored! TOKEN_BESTTAG ::= //token//moot.tag[last()]/text() # should be accurate
TODO: pull up literal element name parameters from TokenReaderExpat to user-level.
TODO: add a DTD for the default XML format to the distribution.
An example ``cooked'' XML document is the following:
<?xml version="1.0"?> <doc> <!-- Sentence-1 : Well Done, Medium, and Medium Rare --> <token> <!-- A 'well done' token with minimal structure --> <text>This</text> <moot.tag>PDAT</moot.tag> <analysis pos="NE"/> <analysis pos="NN"/> <analysis pos="PDAT"/> <analysis pos="PDS"/> </token> <token> <!-- A 'well done' token with extra structure --> <text>is</text> <extraneous.element> <analysis pos="VAFIN"/> <moot.tag>VVFIN</moot.tag> <analysis pos="VVFIN"/> </extraneous.element> </token> <token> <!-- Yet another 'well done' token --> <text>a</text> <other_extraneous_element> <analysis pos="ART"/> </other_extraneous_element> <moot.tag>ART</moot.tag> </token> <token> <!-- A 'medium' token --> <text>Test</text> <moot.tag>NN</moot.tag> </token> <token> <!-- A 'Medium Rare' token --> <text>.</text> <analysis pos="$."/> </token> <eos/> <!-- Sentence-2 : Rare tokens only --> <token><text>This</text></token> <token><text>too</text></token> <token><text>.</text></token> <eos/> </doc>
Several moot utilities are capable of processing input in a number of different formats, typically specified by '-I' and '-O' command-line options. The following list briefly describes the (case-insensitive) format flags which may be passed to such format options. Note that at the current time, not all formats support all available flags.
If no format flags are specified by the user, the moot utilities will attempt to guess an appropriate format based on the filename and on the requirements for the particular utility in question.
The moothmm(1)
program can use either text- or native
binary-format model files, which encode raw frequency counts
(text model files), or probability tables and compile-time
flags for the Hidden Markov Model (binary model files),
respectively.
A ``Text Model'' is completely specified by three files: a lexical freqency file (*.lex), an n-gram frequency file (*.123), and an optional lexical-class frequency file (*.clx).
When specifiying a text model name to a moot utility program, you may specify the model name as TMODEL in order to use the files TMODEL.lex , TMODEL.123 , and TMODEL.clx (if present). Otherwise, you may specifiy a composite model name as a comma-separated list of the individual component filenames: mylex.lex,myngrams.123,myclasses.clx.
LEX_FILE ::= ({COMMENT} | {BLANK_LINE} | {LEX_ENTRY})* COMMENT ::= {SPACE}* "%%" ([^{NEWLINE}])* {NEWLINE} BLANK_LINE ::= {SPACE}* {NEWLINE} LEX_ENTRY ::= {TOKEN_TEXT} {TAB} {TOKEN_TOTAL} ( {TAB} {TAG_COUNT} )* TAG_COUNT ::= {TAG_TEXT} {TAB} {TOK_TAG_CT} TOKEN_TOTAL ::= {COUNT} TOK_TAG_CT ::= {COUNT} TOKEN_TEXT ::= {STRING} | {SPECIAL_TOK} TAG_TEXT ::= {STRING} STRING ::= ( [^{TAB}{NEWLINE}] )+ COUNT ::= ("-"|"+")? ([0-9]* ".")? [0-9]+ NEWLINE ::= "\n" | "\r" TAB ::= "\t" SPECIAL_TOK ::= "@UNKNOWN" | "@CARD" | "@CARDSEPS" | "@CARDPUNCT" | "@CARDSUFFIX"
Leading and trailing spaces are stripped from token and tag text.
The special tokens whose text begins with an '@' character declare counts for special token types:
[0-9]+
([:digit:}+)([\.\,\-]|{:digit:})*
([:digit:}+)({:punct:})
moot_TNT_COMPAT
macro
macro was defined when you compiled libmoot:
If moot_TNT_COMPAT
was defined, then the suffix
of ``@CARDSUFFIX'' tokens is required to be of maximum
length 3, thus matching the regex:
([:digit:}+)(.{1,3})
Otherwise, the suffix for ``@CARDSUFFIX'' tokens may be of arbitrary length:
([:digit:}+)(.*)
An example lexical frequency file is:
%% Example lexical frequency file This 4 PDAT 4 is 1.0 VVFIN 0.7 VAFIN 0.3 a 365 ART 350 CARD 5 test 1 NN 0.5 VVFIN 0.5 too 1 ADV 1 . 42 $. 42
NGRAM_FILE ::= ({COMMENT} | {BLANK_LINE} | {NGRAM_ENTRY})* COMMENT ::= {SPACE}* "%%" ([^{NEWLINE}])* {NEWLINE} BLANK_LINE ::= {SPACE}* {NEWLINE} NGRAM_ENTRY ::= {UNIGRAM} | {BIGRAM} | {TRIGRAM} UNIGRAM ::= {TAG} {TAB} {COUNT} BIGRAM ::= {TAG} {TAB} {TAG} {TAB} {COUNT} TRIGRAM ::= {TAG} {TAB} {TAG} {TAB} {TAG} {TAB} {COUNT} TAG ::= EOS_TAG | ( [^{TAB}{NEWLINE}] )* EOS_TAG ::= "__$" COUNT ::= ("-"|"+")? ([0-9]* ".")? [0-9]+ NEWLINE ::= "\n" | "\r" TAB ::= "\t"
Leading and trailing spaces are stripped from tags. An empty TAG component is populated with the tag in the corresponding position from the last n-gram parsed -- exhaustive use of this feature produces ``short'' format n-gram files. Non-use of this feature produces ``long'' format n-gram files.
An example ``long'' format n-gram file is:
%% Example n-gram frequency file in "long" format __$ 2 __$ PDAT 2 __$ PDAT VVFIN 1 __$ PDAT ADV 1 ADV 1 ADV $. 1 ADV $. __$ 1 ART 1 ART NN 1 ART NN $. 1 PDAT 2 PDAT VVFIN 1 PDAT VVFIN ART 1 PDAT ADV 1 PDAT ADV $. 1 VVFIN 1 VVFIN ART 1 VVFIN ART NN 1 NN 1 NN $. 1 NN $. __$ 1
The same data in ``short'' format:
%% Example n-gram frequency file in "short" format __$ 2 PDAT 2 VVFIN 1 ADV 1 ADV 1 $. 1 __$ 1 ART 1 1 $. 1 PDAT 2 VVFIN 1 ART 1 ADV 1 $. 1 VVFIN 1 ART 1 NN 1 NN 1 $. 1 __$ 1
CLASS_FILE ::= ({COMMENT} | {BLANK_LINE} | {CLASS_ENTRY})* CLASS_ENTRY ::= {CLASS_ELTS} {TAB} {CLASS_TOTAL} ( {TAB} {TAG_COUNT} )* CLASS_ELTS ::= ( {CLASS_TAG} {SPACE} )* CLASS_TAG ::= ( [^{SPACE}{TAB}{NEWLINE}] )+
As for lexical frequency files, leading and trailing whitespaces are stripped from class and tag text.
The CLASS_ELTS component specifies a (space-separated) list of tags belonging to the lexical class. All other (tab-separated) fields are as for a lexical frequency file.
A pair (CLASS,TAG) such that TAG is not an element of CLASS is called an ``contradictory pair'' or an ``impossible pair''. It is not required that the the tags in the TAG_COUNT components of a CLASS_ENTRY are ``possible'' in this sense, although it certainly helps if this is the case.
An example lexical class frequency file is:
%% Example lexical frequency file PDAT NE 4 PDAT 4 VVFIN VAFIN 1.0 VVFIN 0.7 VAFIN 0.3 ART CARD 365 ART 350 CARD 5 NN VVFIN 1 NN 0.5 VVFIN 0.5 ADV 1 ADV 1 $. 42 $. 42
A ``Binary Model'' BINMODEL is a (compressed) binary format file storing a compiled Hidden Markov Model (probabilities and constants). It is completely specified by its filename BINMODEL. By convention, HMM binary model files carry the suffix ``.hmm''.
When specifying an HMM model file, note that the existence of a file BINMODEL overrides any text models which might exists in files BINMODEL.lex , BINMODEL.123 , BINMODEL.clx.
An HMM dump is a plain text file containing all the information stored in a compiled HMM. The format exists solely for purposes of debugging.
Development of this package was supported by the project 'Kollokationen im Wörterbuch' ( ``collocations in the dictionary'', http://www.bbaw.de/forschung/kollokationen ) in association with the project 'Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)' ( ``digital dictionary of the German language of the 20th century'', http://www.dwds.de ) at the Berlin-Brandenburgische Akademie der Wissenschaften ( http://www.bbaw.de ) with funding from the Alexander von Humboldt Stiftung ( http://www.avh.de ) and from the Zukunftsinvestitionsprogramm of the German federal government.
I am grateful to Christiane Fellbaum, Alexander Geyken, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.
Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used in the development of the class-based HMM tagger / disambiguator. Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development of the class-based HMM tagger / disambiguator.
Bryan Jurish <moocow@ling.uni-potsdam.de>
the mootpp manpage, the mootrain manpage, mootm, moothmm, the moot manpage