DTA::CAB / Workshop Materials


Use Cases

WebLicht

  • CLARIN-D online modular NLP-tool orchestrator (requires login)
  • Example chain:
    • Input: EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
    • Input type: plain text
    • Input language: German (de)
    • Chain:
      • Berl: Plaintext converter
      • Berl: Tokenizer and Sentence Splitter
      • Berl: CAB historical text analysis
  • Caveat: the online viewer (TüNDRA) does not support the TCF "orthography" layer, which CAB uses to store "canonical" modern wordforms.

Quality Assurance

Corpus Search

Error Database

Named Entity Recognition

Term Expansion

Corpus Vocabulary

  • corpus frequencies database over 5-tuples (u:text, w:xlit, v:canon, p:pos, l:lemma)
  • Example: Lemma "Teil"

Distributional Semantics

(Diachronic) Collocation Profiling

  • track "significant" co-occurrence preferences over time (e.g. using modern lemmata)
  • Example (gender bias): "Mann" vs. "Frau" (DTA)

Exercises

Normalize a raw text file

  • Download this example: elephant.raw
  • Use the CAB web-service file upload interface
  • Save the output data to your computer using the TAB-separated CSV format (e.g. as elephant.tsv)
  • Extra credit: import the file you just saved into your favorite spreadsheet program (e.g. LibreOffice calc, google Sheets, etc.)
  • Even more extra credit: export all and only the "canonical word-forms" (3rd column) from the output
  • Super bonus extra credit: skip the spreadsheet GUI and use the command-line (e.g. awk, sed, or perl) to extract canonical word-forms.

Normalize a TCF file using WebLicht

  • Download this example. elephant.raw.tcf
  • Analyze the file using WebLicht (requires CLARIN credentials!)
  • Download the output TCF data to your computer (e.g. as elephant.cab.tcf)
  • Extra credit: apply the tcf-orthswap.xsl XSL transformation as described here to swap the "tokens" and "orthography" layers, then re-load the the modified file into WebLicht and do some more processing.

Normalize a raw TEI-XML file

Normalize a pre-tokenized TEI-XML file

  • Download this example: elephant.teiws-xml
  • Use the CAB web-service file upload interface
  • Save the output data to your computer using the default TEIws format (e.g. as elephant.teiws-cab.xml)
  • Extra credit: apply the spliced2ling.xsl XSL transformation as described here to produce a TEI-ling conformant document
  • More extra credit: use curl from the command-line to analyze elephant.teiws-xml, as described here.
    • Hint: try using the cab-curl-xpost.sh script (if you have bash installed, as all upstanding folk ought to)
    • Caveat: If you're a script-friendly sort of person and wish to do something like this on a large scale, please play nicely ... or at least warn me first.
  • Super bonus extra credit: use curl from the command-line to analyze elephant.teiws-xml and output TEI-ling directly in a single call.
    • Hint: prior to analysis, the TEIws and TEI-ling formats are indistinguishable

Term Expansion

  • Find all surface forms for the lemma "Elefant"
    • use CAB's expand.eqlemma analysis chain
    • generate the output as a flat list of target forms using the XList format
    • ... or you can cheat by using this link
  • Extra credit: convert the returned list to a disjunction query (logical "OR") for your favorite search engine.
  • Super bonus extra credit: write a plugin for your favorite search engine to transparently convert all naive bareword user queries to implicit disjunctions over the variant forms returned by the CAB web-service, analogous to DDC's Expand Cab functionality.