Bryan Jurish / About

Photo

Summary

Bryan Jurish is a researcher at the Berlin-Brandenburg Academy of Sciences and Humanities. He received a B.A. in Philosophy and Cognitive Science from Northwestern University in 1996, where he was awarded the Daniel Bonbright Scholar award for excellence in the humanities. His subsequent study of Computational Linguistics at the Universität Potsdam led to a Diplom in 2002 and a Ph.D. (summa cum laude) in 2011. His research interests include diachronic computational linguistics, automated spelling correction for historical and non-standard text, theory and applications of weighted finite-state automata, noise-tolerant stochastic models for natural language processing tasks, and formal language models of musical structure.

Selected Projects

  • DiaColloDB ("Diachronic Collocation Database"): A suite of tools for extraction of significant collocates from a diachronic text corpus using either efficient native index structures or a DDC search-engine to acquire underlying frequency data.

  • DTA::CAB ("Cascaded Analysis Broker"): A command-line and client/server suite for robust and reliable orthographic canonicalization of historical input text, in Perl.

  • GFSM ("GFSM Finite State Manipulation" Suite): A C library for representation and manipulation of (weighted) finite-state machines, using GLIB for low-level data structures. Includes GFSMXL, an extension library for online k-best string lookup operations in weighted finite-state transducer cascades.

  • moot ("moot Tagger"): A C++ library and program suite for highly accurate part-of-speech tagging in the presence of a strong morphological component, using ambiguity classes to improve performance for unknown words. Includes classes and programs for supervised training, tagging, model compilation, evaluation, and dynamic modelling.

  • WASTE ("Word And Sentence Tokenization Estimator"): A framework for detecting word and sentence boundaries in raw text using a Hidden Markov Model to estimate boundary placement in a stream of candidate word-like segments returned by a low-level rule-based scanner stage. Pre-built WASTE models exist for a number of languages, and additional models can be defined for various languages, genres, orthographic conventions, and/or target boundary-placement conventions with appropriate training material. WASTE is currently implemented as an extension to the moot part-of-speech tagging library. (Joint work with Kay-Michael Würzner)

  • DDC v2.x ("DDC Concordancer"): An efficient and scalable corpus indexing and retrieval engine originally written by Alexey Sokirko. The 2.x branch of DDC supports multiple quasi-independent token-level attributes, flexible HTTP-based online query-term expansion, as well as index fragmentation to take advantage of contemporary multi-threaded server hardware and distributed corpora.