DTA::CAB / Workshop Materials

Introduction

CLARIN-D online modular NLP-tool orchestrator (requires login)
Example chain:
- Input: EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
- Input type: plain text
- Input language: German (de)
- Chain:
  - Berl: Plaintext converter
  - Berl: Tokenizer and Sentence Splitter
  - Berl: CAB historical text analysis
Caveat: the online viewer (TüNDRA) does not support the TCF "orthography" layer, which CAB uses to store "canonical" modern wordforms.

Search engine used by BBAW Zentrum Sprache (incl. DTA, DWDS, ZDL)
- input texts are pre-processed by DTA::CAB
- canonical modern word-form is indexed as $CanonicalToken ($v)
- canonical modern lemma is indexed as $Lemma ($l)
Example queries:
- modern Lemma "Hilfe" in Kant (DTA)
- modern Lemma "Hilfe" in Kant (D*)

online database of CAB canonicalization errors (requires login)
... used to populate exception lexicon (daily)
... used to optimize rewrite cascade weights (weekyl)
... used to optimize magic constants & cutoff thresholds (manually, ca. 1x / year)

corpus frequencies database over 5-tuples (u:text, w:xlit, v:canon, p:pos, l:lemma)
Example: Lemma "Teil"

approximate semantic similarity via k-nearest neighbors in a high dimensional space (term x document matrix)
- data sparsity reduction via canonicalization + lemmatization ("terms" = modern lemmata)
Examples:
- term->term: "Vernunft"
- term->page: "Ding"
- term->book: "Produktion"
- book->book: book=kant_rvernunft_1781
- book->term: book=marx_kapital01_1867

track "significant" co-occurrence preferences over time (e.g. using modern lemmata)
Example (gender bias): "Mann" vs. "Frau" (DTA)

Download this example: elephant.raw
Use the CAB web-service file upload interface
Save the output data to your computer using the TAB-separated CSV format (e.g. as elephant.tsv)
Extra credit: import the file you just saved into your favorite spreadsheet program (e.g. LibreOffice calc, google Sheets, etc.)
Even more extra credit: export all and only the "canonical word-forms" (3rd column) from the output
Super bonus extra credit: skip the spreadsheet GUI and use the command-line (e.g. awk, sed, or perl) to extract canonical word-forms.

Download this example. elephant.raw.tcf
Analyze the file using WebLicht (requires CLARIN credentials!)
- If you don't have CLARIN credentials, you can use the CAB file upload interface directly.
Download the output TCF data to your computer (e.g. as elephant.cab.tcf)
Extra credit: apply the tcf-orthswap.xsl XSL transformation as described here to swap the "tokens" and "orthography" layers, then re-load the the modified file into WebLicht and do some more processing.

Download this example: elephant.tei-xml
- Alternative: you can also download TEI-XML for any DTA work, e.g. kant_aufklaerung_1784
- Caveat: beware request size limits!
Use the CAB web-service file upload interface
Save the output data to your computer using the TEI-fast format (e.g. as elephant.tei-cab.xml)
Extra credit: apply the spliced2norm.xsl XSL transformation as described here to produce a "caonicalized" variant of the source XML document.
- Alternative: analyze & normalize in a single operation using the DTAQ tool "texte normalisieren"

Download this example: elephant.teiws-xml
Use the CAB web-service file upload interface
Save the output data to your computer using the default TEIws format (e.g. as elephant.teiws-cab.xml)
Extra credit: apply the spliced2ling.xsl XSL transformation as described here to produce a TEI-ling conformant document
More extra credit: use curl from the command-line to analyze elephant.teiws-xml, as described here.
- Hint: try using the cab-curl-xpost.sh script (if you have bash installed, as all upstanding folk ought to)
- Caveat: If you're a script-friendly sort of person and wish to do something like this on a large scale, please play nicely ... or at least warn me first.
Super bonus extra credit: use curl from the command-line to analyze elephant.teiws-xml and output TEI-ling directly in a single call.
- Hint: prior to analysis, the TEIws and TEI-ling formats are indistinguishable

Find all surface forms for the lemma "Elefant"
- use CAB's expand.eqlemma analysis chain
- generate the output as a flat list of target forms using the XList format
- ... or you can cheat by using this link
Extra credit: convert the returned list to a disjunction query (logical "OR") for your favorite search engine.
Super bonus extra credit: write a plugin for your favorite search engine to transparently convert all naive bareword user queries to implicit disjunctions over the variant forms returned by the CAB web-service, analogous to DDC's Expand Cab functionality.