Documentation
If you are looking for documentation on the DTA::CAB web-service, start here.General
- README.html
- DTA::CAB Workshop Materials
- DTA::CAB(3pm): top-level module, basic introduction
- DTA::CAB::Format(3pm): top-level data format class
- DTA::CAB::index: auto-generated documentation index
- HTML documentation directory
Command-line utilities for local analysis & conversion
HTTP Server/Client Stuff
- DTA::CAB::WebServiceHowto(3pm): user documentation for the DTA::CAB web-service
- DTA::CAB::HttpProtocol(3pm): CAB HTTP protocol conventions
- dta-cab-http-client.perl(1)
- dta-cab-http-server.perl(1)
XML-RPC Server/Client Stuff (deprecated)
- dta-cab-xmlrpc-client.perl(1)
- dta-cab-xmlrpc-server.perl(1)
- DTA::CAB::XmlRpcProtocol(3pm): XML-RPC communication protocol conventions
Sources
... are available on CPAN.Batteries Not Included: You should be aware that the source code distribution alone is not sufficient to set up and run a complete CAB analysis pipeline on your local site. In order to do that, you will also need various assorted language models and additional resources which are not themselves part of CAB (which aspires to be language-agnostic), and therefore not included in the source code distribution. See Jurish (2012) and the source code documentation for more details.
Publications
I would appreciate CAB users citing its use in any related publications As a general CAB-related reference, please cite:- Jurish, B. Finite-state Canonicalization Techniques for Historical German. PhD thesis, Universität Potsdam, 2012 (defended 2011). URN urn:nbn:de:kobv:517-opus-55789, [epub, PDF, BibTeX]
- Jurish, B. "Finding canonical forms for historical German text" In A. Storrer, A. Geyken, A. Siebert and K.-M. Würzner (editors), Text Resources and Lexical Knowledge selected papers from the 9th Conference on Natural Language Processing (KONVENS 2008), pages 27-37. Berlin, de Gruyter, September, 2008. ISBN 978-3-11-020735-4. (pdf:draft, bib) Also appears as Ch. 1 of Jurish (2012)
- Jurish, B. "Efficient online k-best lookup in weighted finite-state cascades." In T. Hanneforth and G. Fanselow (editors), Language and Logos: Studies in Theoretical and Computational Linguistics, volume 72 of Studia grammatica. Akademie Verlag, Berlin, 2010. ISBN 978-3-05-004931-1. (pdf:draft, bib) Also appears as Ch. 2 of Jurish (2012)
- Jurish, B. "Comparing canonicalizations of historical German text." In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology (SIGMORPHON), pages 72-77, Uppsala, Sweden, 15 July 2010. (pdf bib) Also appears as Ch. 3 of Jurish (2012)
- Jurish, B. "More than words: using token context to improve canonicalization of historical German." Journal for Language Technology and Computational Linguistics, 25(1):23-40, 2010. (pdf, bib) Also appears as Ch. 4 of Jurish (2012)
- Jurish, B. "Canonicalizing the Deutsches Textarchiv." In I. Hafemann (editor), Proceedings of Perspektiven einer corpusbasierten historischen Linguistik und Philologie (Berlin, Germany, 12th-13th December 2011), volume 4 of Thesaurus Linguae Aegyptiae, Berlin-Brandenburgische Akademie der Wissenschaften, 2013. [PDF, BibTeX]
- Jurish, B., M. Drotschmann, & H. Ast. "Constructing a canonicalized corpus of historical German by text alignment." In P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt (editors), New Methods in Historical Corpora, volume 3 of Corpus Linguistics and Interdisciplinary Perspectives on Language (CLIP), pages 221-234. Narr, Tübingen, 2013. (pdf:draft, bib)
- Jurish, B., C. Thomas, & F. Wiegand. "Querying the Deutsches Textarchiv." In U. Kruschwitz, F. Hopfgartner, & C. Gurrin (editors), Proceedings of the Workshop MindTheGap 2014: Beyond Single-Shot Text Queries: Bridging the Gap(s) between Research Communities, Berlin, Germany, 4th March, 2014, pages 25-30, 2014. [PDF, BibTeX]
- Jurish, B. & H. Ast. "Using an alignment-based lexicon for canonicalization of historical text." In J. Gippert & R. Gehrke (editors), Historical Corpora: Challenges and Perspectives, volume 5 of Corpus Linguistics and Interdisciplinary Perspectives on Language (CLIP), pages 197-208. Narr, Tübingen, 2015. (pdf:draft, bib)
Links
- CAB Web Service (public)
- Twiki/Software/DtaCab (internal)
- SVN dev/DTA-CAB (internal)
Related Packages
- GFSM: finite-state library
- GFSM::XL: finite-state cascade lookup library
- moot: HMM utility suite
- Taxi::Mysql: flexible document indexing system with some DTA::CAB-like features
- Lingua::LTS: LTS ruleset compiler/interpreter with standalone transducer lookup
- unicruft: transliteration C library