DTA::CAB / Workshop Materials
Use Cases
- CLARIN-D online modular NLP-tool orchestrator (requires login)
Example chain:
- Input: EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
- Input type: plain text
- Input language: German (de)
- Berl: Plaintext converter
- Berl: Tokenizer and Sentence Splitter
- Berl: CAB historical text analysis
Caveat: the online viewer (TüNDRA) does not support the TCF "orthography" layer,
which CAB uses to store "canonical" modern wordforms.
Search engine used by BBAW Zentrum Sprache (incl. DTA, DWDS, ZDL)
Example queries:
- User-specified selection of corpus-specific graphematic variants
- corpus frequencies database over 5-tuples (u:text, w:xlit, v:canon, p:pos, l:lemma)
- Example: Lemma "Teil"
approximate semantic similarity via k-nearest neighbors in a high dimensional space (term x document matrix)
- data sparsity reduction via canonicalization + lemmatization ("terms" = modern lemmata)
- track "significant" co-occurrence preferences over time (e.g. using modern lemmata)
Example (gender bias):
"Mann" vs. "Frau" (DTA)
Normalize a raw text file
- Download this example: elephant.raw
- Use the CAB web-service file upload interface
- Save the output data to your computer using the TAB-separated CSV format
(e.g. as elephant.tsv)
Extra credit: import the file you just saved into your favorite spreadsheet program (e.g. LibreOffice calc, google Sheets, etc.)
Even more extra credit: export all and only the "canonical word-forms" (3rd column) from the output
Super bonus extra credit: skip the spreadsheet GUI and use the command-line (e.g. awk, sed, or perl)
to extract canonical word-forms.
Normalize a TCF file using WebLicht
- Download this example. elephant.raw.tcf
Analyze the file using WebLicht (requires CLARIN credentials!)
- Download the output TCF data to your computer (e.g. as elephant.cab.tcf)
Extra credit:
apply the tcf-orthswap.xsl XSL transformation as described
to swap the "tokens" and "orthography" layers, then re-load the the modified file into WebLicht
and do some more processing.
Term Expansion
Find all surface forms for the lemma "Elefant"
- use CAB's expand.eqlemma analysis chain
- generate the output as a flat list of target forms using the XList format
- ... or you can cheat by using this link
Extra credit:
convert the returned list to a disjunction query (logical "OR") for your favorite search engine.
Super bonus extra credit:
write a plugin for your favorite search engine to transparently convert all naive bareword user
queries to implicit disjunctions over the variant forms returned by the CAB web-service,
analogous to DDC's Expand Cab