Mid-level scanner stage performs (optional) hyphenation normalization and text classification.

Collaboration diagram for moot::wasteLexer:

[legend]

Public Types
Embedded Types
typedef std::vector< std::vector< std::vector< std::vector< std::vector< std::vector< std::string > > > > > >	wasteTagset

typedef std::list< wasteLexerToken >	wasteLexerBuffer

Public Member Functions
Constructors etc.
	wasteLexer ()

virtual	~wasteLexer ()

Lexing functions
len	length_attr (size_t length) const

void	set_token (mootToken &token, const wasteLexerToken &lex_token)

void	buffer_token (const mootToken &stok)

void	reset (void)

low-level utilities
void	lexbuf_pop_front (void)

Public Attributes
Low-level data
wasteTagset	wl_tagset

int	wl_state

wasteLexerBuffer	wl_lexbuf

wasteLexerToken *	wl_current_tok

wasteLexerToken *	wl_head_tok

bool	wl_dehyph_mode

Lexica
wasteLexicon	wl_stopwords

wasteLexicon	wl_abbrevs

wasteLexicon	wl_conjunctions

token classification feature indices
enum	cls { stop = 0, rom = 1, alpha = 2, num = 3, dot = 4, comma = 5, colon = 6, scolon = 7, eos = 8, lbr = 9, rbr = 10, hyphen = 11, plus = 12, slash = 13, quote = 14, apos = 15, sc = 16, other = 17, n_cls = 18 }

enum	cas { non = 0, lo = 1, up = 2, cap = 3, n_cas = 4 }

enum	binary { uk = 0, kn = 1, n_binary = 2 }

enum	len { le_null = 0, le_one = 1, le_three = 2, le_five = 3, longer = 4, n_len = 5 }

static const int	n_hidden = 7

Member Typedef Documentation

◆ wasteTagset

typedef std::vector<std::vector<std::vector<std::vector<std::vector<std::vector<std::string> > > > > > moot::wasteLexer::wasteTagset

Multi-dimensional vector for constant access on feature bundles

◆ wasteLexerBuffer

typedef std::list<wasteLexerToken> moot::wasteLexer::wasteLexerBuffer

List of wasteLexerToken for buffering while dehyphenating

Member Enumeration Documentation

◆ cls

enum moot::wasteLexer::cls

typographical classes

Enumerator
stop
rom
alpha
num
dot
comma
colon
scolon
eos
lbr
rbr
hyphen
plus
slash
quote
apos
sc
other
n_cls

◆ cas

enum moot::wasteLexer::cas

letter-case

Enumerator
non
lo
up
cap
n_cas

◆ binary

enum moot::wasteLexer::binary

binary feature values

Enumerator
uk
kn
n_binary

◆ len

enum moot::wasteLexer::len

text length

Enumerator
le_null
le_one
le_three
le_five
longer
n_len

Constructor & Destructor Documentation

◆ wasteLexer()

moot::wasteLexer::wasteLexer ( )

Default constructor

◆ ~wasteLexer()

virtual moot::wasteLexer::~wasteLexer ( )

virtual

Destructor

Member Function Documentation

◆ length_attr()

len moot::wasteLexer::length_attr ( size_t length ) const

inline

Length to length attribute conversion

References mootTokenLexer::reset().

◆ set_token()

void moot::wasteLexer::set_token	(	mootToken &	token,
		const wasteLexerToken &	lex_token
	)

Set token features (token.tok_analyses) w.r.t. model features from source lex_token

◆ buffer_token()

void moot::wasteLexer::buffer_token ( const mootToken & stok )

copies stok to internal buffer. If wl_dehyph_mode is true, seeks and removes hyphenations.

◆ reset()

void moot::wasteLexer::reset ( void )

◆ lexbuf_pop_front()

void moot::wasteLexer::lexbuf_pop_front ( void )

inline

wrapper for wl_lexbuf.pop_front() which may clear wl_current_tok, wl_head_tok

Note: calling this method may break auto-magic de-hyphenation; workaround for crashes (segfault, double-free, heap corruption) on kira (g++5, ubuntu-16.04), 2016-11-18 (moocow); see wasteLexer::get_token() at wasteLexer.cc:477

Member Data Documentation

◆ n_hidden

const int moot::wasteLexer::n_hidden = 7

static

hidden features: s[01],S[01],w[01] (except s1,*,w0)

◆ wl_tagset

wasteTagset moot::wasteLexer::wl_tagset

Token feature bundles

◆ wl_state

int moot::wasteLexer::wl_state

Current state of the lexer

◆ wl_lexbuf

wasteLexerBuffer moot::wasteLexer::wl_lexbuf

Buffer for dehyphenation: list of wasteLexerToken

◆ wl_current_tok

wasteLexerToken* moot::wasteLexer::wl_current_tok

current token under construction (NULL for none), pointer into wl_lexbuf

◆ wl_head_tok

wasteLexerToken* moot::wasteLexer::wl_head_tok

head of hyphenation sequence (NULL for none), pointer into wl_lexbuf

◆ wl_dehyph_mode

bool moot::wasteLexer::wl_dehyph_mode

Dehyphenation switch

Referenced by moot::wasteLexerReader::dehyph_mode().

◆ wl_stopwords

wasteLexicon moot::wasteLexer::wl_stopwords

List of stopwords

◆ wl_abbrevs

wasteLexicon moot::wasteLexer::wl_abbrevs

List of abbreviations

◆ wl_conjunctions

wasteLexicon moot::wasteLexer::wl_conjunctions

List of conjunctions (for dehyphenating)

The documentation for this class was generated from the following file:

wasteLexer.h

Public Types

Public Member Functions

Public Attributes

token classification feature indices