DTA::CAB

Documentation

If you are looking for documentation on the DTA::CAB web-service, start here.

General

README.html
DTA::CAB Workshop Materials
DTA::CAB(3pm): top-level module, basic introduction
DTA::CAB::Format(3pm): top-level data format class
DTA::CAB::index: auto-generated documentation index
HTML documentation directory

Command-line utilities for local analysis & conversion

HTTP Server/Client Stuff

DTA::CAB::WebServiceHowto(3pm): user documentation for the DTA::CAB web-service
DTA::CAB::HttpProtocol(3pm): CAB HTTP protocol conventions
dta-cab-http-client.perl(1)
dta-cab-http-server.perl(1)

XML-RPC Server/Client Stuff (deprecated)

dta-cab-xmlrpc-client.perl(1)
dta-cab-xmlrpc-server.perl(1)
DTA::CAB::XmlRpcProtocol(3pm): XML-RPC communication protocol conventions

Sources

... are available on CPAN.

Batteries Not Included: You should be aware that the source code distribution alone is not sufficient to set up and run a complete CAB analysis pipeline on your local site. In order to do that, you will also need various assorted language models and additional resources which are not themselves part of CAB (which aspires to be language-agnostic), and therefore not included in the source code distribution. See Jurish (2012) and the source code documentation for more details.

Publications

I would appreciate CAB users citing its use in any related publications As a general CAB-related reference, please cite:

Jurish, B. Finite-state Canonicalization Techniques for Historical German. PhD thesis, Universität Potsdam, 2012 (defended 2011). URN urn:nbn:de:kobv:517-opus-55789, [epub, PDF, BibTeX]

Other CAB-related publications include:

Jurish, B. "Finding canonical forms for historical German text" In A. Storrer, A. Geyken, A. Siebert and K.-M. Würzner (editors), Text Resources and Lexical Knowledge selected papers from the 9th Conference on Natural Language Processing (KONVENS 2008), pages 27-37. Berlin, de Gruyter, September, 2008. ISBN 978-3-11-020735-4. (pdf:draft, bib) Also appears as Ch. 1 of Jurish (2012)
Jurish, B. "Efficient online k-best lookup in weighted finite-state cascades." In T. Hanneforth and G. Fanselow (editors), Language and Logos: Studies in Theoretical and Computational Linguistics, volume 72 of Studia grammatica. Akademie Verlag, Berlin, 2010. ISBN 978-3-05-004931-1. (pdf:draft, bib) Also appears as Ch. 2 of Jurish (2012)
Jurish, B. "Comparing canonicalizations of historical German text." In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology (SIGMORPHON), pages 72-77, Uppsala, Sweden, 15 July 2010. (pdf bib) Also appears as Ch. 3 of Jurish (2012)
Jurish, B. "More than words: using token context to improve canonicalization of historical German." Journal for Language Technology and Computational Linguistics, 25(1):23-40, 2010. (pdf, bib) Also appears as Ch. 4 of Jurish (2012)
Jurish, B. "Canonicalizing the Deutsches Textarchiv." In I. Hafemann (editor), Proceedings of Perspektiven einer corpusbasierten historischen Linguistik und Philologie (Berlin, Germany, 12th-13th December 2011), volume 4 of Thesaurus Linguae Aegyptiae, Berlin-Brandenburgische Akademie der Wissenschaften, 2013. [PDF, BibTeX]
Jurish, B., M. Drotschmann, & H. Ast. "Constructing a canonicalized corpus of historical German by text alignment." In P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt (editors), New Methods in Historical Corpora, volume 3 of Corpus Linguistics and Interdisciplinary Perspectives on Language (CLIP), pages 221-234. Narr, Tübingen, 2013. (pdf:draft, bib)
Jurish, B., C. Thomas, & F. Wiegand. "Querying the Deutsches Textarchiv." In U. Kruschwitz, F. Hopfgartner, & C. Gurrin (editors), Proceedings of the Workshop MindTheGap 2014: Beyond Single-Shot Text Queries: Bridging the Gap(s) between Research Communities, Berlin, Germany, 4th March, 2014, pages 25-30, 2014. [PDF, BibTeX]
Jurish, B. & H. Ast. "Using an alignment-based lexicon for canonicalization of historical text." In J. Gippert & R. Gehrke (editors), Historical Corpora: Challenges and Perspectives, volume 5 of Corpus Linguistics and Interdisciplinary Perspectives on Language (CLIP), pages 197-208. Narr, Tübingen, 2015. (pdf:draft, bib)

Related Packages

GFSM: finite-state library
GFSM::XL: finite-state cascade lookup library
moot: HMM utility suite
Taxi::Mysql: flexible document indexing system with some DTA::CAB-like features
Lingua::LTS: LTS ruleset compiler/interpreter with standalone transducer lookup
unicruft: transliteration C library