moot/WASTE: About

About moot/WASTE

WASTE ("Word and Sentence Tokenization Estimator") is a framework for detecting word and sentence boundaries in raw text using a Hidden Markov Model to estimate boundary placement in a stream of candidate word-like segments returned by a low-level rule-based scanner stage. Pre-built WASTE models exist for a number of languages, and additional models can be defined for various languages, genres, orthographic conventions, and/or target boundary-placement conventions with appropriate training material. WASTE is implemented as an extension to the moot ("moot Object-Oriented Tagger") C++ library for Hidden Markov Model part-of-speech tagging.

moot/WASTE: About

About moot/WASTE

Links