libmoot: moot PoS tagging library
Author
Bryan Jurish mooco.nosp@m.w@cp.nosp@m.an.or.nosp@m.g
Version
2.0.14

Introduction

libmoot is a C++ library for Part-of-Speech (PoS) tagging. In addition to traditional bigram tagging routines, libmoot allows the use of user-specified a priori sets of possible analyses for each input token ("lexical classes"), which has been shown to lead to a reduction in errors of up to 32% with respect to traditional Hidden-Markov-Model (HMM) methods.

Preprocessing

libmoot includes a rudimentary preprocessor for raw text, which tokenizes an input stream, and eliminates most SGML markup.

See also
mootPPLexer
mootpp(1)

HMM Tagging

The mootHMM class provides an implementation of a traditional HMM tagging and disambiguation, optionally extended by lexical-class probabilities, which can be helpful if you have some prior information on what sorts of tags your input tokens might in fact happen to carry.

See also
mootHMM
mootrain(1)
moot(1)

Tagger Evaluation

The mootEval class provides an API for (cross-)evaluation of parallel tagged files, optionally extended by prior analyses.

See also
mootEval
mooteval(1)

Extendible I/O

The high-level mootTokenIO layer comprises the TokenReader and TokenWriter classes, which provide an abstract API specification for definition of user-defined I/O protocols using the C++ virtual method convention. Builtin Token specializations include classes for native text-format and XML I/O.

The low-level mootio abstraction layer provides wrappers for several common stream flavors, including C streams (FILE*s), C++ streams, C memory buffers (char*s), as well as zlib compressed file streams (gzFile).

See also
TokenReader
TokenReaderNative
TokenReaderExpat
TokenWriter
TokenWriterNative
TokenWriterExpat

Acknowledgements

Development of this package was supported by the project Kollokationen im Wörterbuch ("collocations in the dictionary") in association with the project Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts(DWDS) ("digital dictionary of the German language") at the Berlin-Brandenburgische Akademie der Wissenschaften with funding from the Alexander von Humboldt Stiftung and from the Zukunftsinvestitionsprogramm of the German federal government.

I am grateful to Christiane Fellbaum, Alexander Geyken, Thomas Hanneforth, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.

Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used in the development of the class-based HMM tagger / disambiguator.

Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.

More Information

See also
mootMorph: libmootm extension library
moottut(1): moot user tutorial
mootutils(1): moot command-line utilities summary
mootfiles(5): moot file formats