About moot/WASTE
WASTE ("Word and Sentence Tokenization Estimator") is a framework for detecting word and sentence boundaries in raw text using a Hidden Markov Model to estimate boundary placement in a stream of candidate word-like segments returned by a low-level rule-based scanner stage. Pre-built WASTE models exist for a number of languages, and additional models can be defined for various languages, genres, orthographic conventions, and/or target boundary-placement conventions with appropriate training material. WASTE is implemented as an extension to the moot ("moot Object-Oriented Tagger") C++ library for Hidden Markov Model part-of-speech tagging.
Links
- moot part-of-speech tagging HMM utility suite
- Downloads of selected pre-built WASTE models
- Online demo for live tokenization or file upload
- "I/O Format Flags" for use with moot/WASTE
- Bryan Jurish and Kay-Michael Würzner. "Word and Sentence Tokenization with Hidden Markov Models." Journal for Language Technology and Computational Linguistics, 28(2):61-83, 2013 (alternate link)
- The WASTE acronym and logo were inspired by Thomas Pynchon's The Crying of Lot 49.