DTA::CAB / Workshop Materials
Use Cases
- CLARIN-D online modular NLP-tool orchestrator (requires login)
-
Example chain:
- Input: EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
- Input type: plain text
- Input language: German (de)
-
Chain:
- Berl: Plaintext converter
- Berl: Tokenizer and Sentence Splitter
- Berl: CAB historical text analysis
-
Caveat: the online viewer (TüNDRA) does not support the TCF "orthography" layer,
which CAB uses to store "canonical" modern wordforms.
-
Search engine used by BBAW Zentrum Sprache (incl. DTA, DWDS, ZDL)
-
Example queries:
- User-specified selection of corpus-specific graphematic variants
-
Examples:
- corpus frequencies database over 5-tuples (u:text, w:xlit, v:canon, p:pos, l:lemma)
- Example: Lemma "Teil"
-
approximate semantic similarity via k-nearest neighbors in a high dimensional space (term x document matrix)
- data sparsity reduction via canonicalization + lemmatization ("terms" = modern lemmata)
-
Examples:
- track "significant" co-occurrence preferences over time (e.g. using modern lemmata)
-
Example (gender bias):
"Mann" vs. "Frau" (DTA)
Exercises
Normalize a raw text file
- Download this example: elephant.raw
- Use the CAB web-service file upload interface
- Save the output data to your computer using the TAB-separated CSV format
(e.g. as elephant.tsv)
-
Extra credit: import the file you just saved into your favorite spreadsheet program (e.g. LibreOffice calc, google Sheets, etc.)
-
Even more extra credit: export all and only the "canonical word-forms" (3rd column) from the output
-
Super bonus extra credit: skip the spreadsheet GUI and use the command-line (e.g. awk, sed, or perl)
to extract canonical word-forms.
Normalize a TCF file using WebLicht
- Download this example. elephant.raw.tcf
-
Analyze the file using WebLicht (requires CLARIN credentials!)
- Download the output TCF data to your computer (e.g. as elephant.cab.tcf)
-
Extra credit:
apply the tcf-orthswap.xsl XSL transformation as described
here
to swap the "tokens" and "orthography" layers, then re-load the the modified file into WebLicht
and do some more processing.
Term Expansion
-
Find all surface forms for the lemma "Elefant"
- use CAB's expand.eqlemma analysis chain
- generate the output as a flat list of target forms using the XList format
- ... or you can cheat by using this link
-
Extra credit:
convert the returned list to a disjunction query (logical "OR") for your favorite search engine.
-
Super bonus extra credit:
write a plugin for your favorite search engine to transparently convert all naive bareword user
queries to implicit disjunctions over the variant forms returned by the CAB web-service,
analogous to DDC's Expand Cab
functionality.