libmoot is a C++ library for Part-of-Speech (PoS) tagging. In addition to traditional bigram tagging routines, libmoot allows the use of user-specified a priori sets of possible analyses for each input token ("lexical classes"), which has been shown to lead to a reduction in errors of up to 32% with respect to traditional Hidden-Markov-Model (HMM) methods.
libmoot includes a rudimentary preprocessor for raw text, which tokenizes an input stream, and eliminates most SGML markup.
The mootHMM class provides an implementation of a traditional HMM tagging and disambiguation, optionally extended by lexical-class probabilities, which can be helpful if you have some prior information on what sorts of tags your input tokens might in fact happen to carry.
The mootEval class provides an API for (cross-)evaluation of parallel tagged files, optionally extended by prior analyses.
The high-level mootTokenIO layer comprises the TokenReader and TokenWriter classes, which provide an abstract API specification for definition of user-defined I/O protocols using the C++ virtual method convention. Builtin Token specializations include classes for native text-format and XML I/O.
The low-level mootio abstraction layer provides wrappers for several common stream flavors, including C streams (FILE*s), C++ streams, C memory buffers (char*s), as well as zlib compressed file streams (gzFile).
Development of this package was supported by the project Kollokationen im Wörterbuch ("collocations in the dictionary") in association with the project Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts(DWDS) ("digital dictionary of the German language") at the Berlin-Brandenburgische Akademie der Wissenschaften with funding from the Alexander von Humboldt Stiftung and from the Zukunftsinvestitionsprogramm of the German federal government.
I am grateful to Christiane Fellbaum, Alexander Geyken, Thomas Hanneforth, Gerald Neumann, Edmund Pohl, Alexey Sokirko, and others for offering useful insights in the course of development of this package.
Thomas Hanneforth wrote and maintains the libFSM C++ library for finite-state device operations used in the development of the class-based HMM tagger / disambiguator.
Alexander Geyken and Thomas Hanneforth developed the rule-based morphological analysis system for German which was used in the development and testing of the class-based HMM tagger / disambiguator.