This README was last updated for the DTA EvalCorpus v0.03, 2016-09-13.
The DTA EvalCorpus contains token-aligned training+evaluation data for canonicalization/normalization of historical German text. See the references below under "ATTRIBUTION" for details on the corpus construction.
Bryan Jurish, Henriette Ast, Marko Drotschmann, and Christian Thomas.
This corpus was created by semi-automatic alignment of a Deutsches Textarchiv (DTA) text with a contemporary edition drawn from Project Gutenberg and/or the TextGrid Digital Library.
The Deutsches Textarchiv text sources are distributed under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported License, see http://creativecommons.org/licenses/by-nc/3.0/.
Contemporary text sources from Project Gutenberg are distributed under the terms of the The Project Gutenberg License, see http://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License
Contemporary text sources from the TextGrid Digital Library were provided by TextGrid from the data stock of TextGrid's Digital Library, www.editura.de, and are published under the Creative Commons "by" license version 3.0, see http://creativecommons.org/licenses/by/3.0/ and http://textgrid.de/en/digitale-bibliothek
This corpus itself is distributed under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported License, see http://creativecommons.org/licenses/by-nc/3.0/
If you make use of this corpus in your research, we ask that you cite one or both of the following articles in any associated publications:
Jurish, B., M. Drotschmann, & H. Ast. "Constructing a canonicalized corpus of historical German by text alignment." In P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt (editors), New Methods in Historical Corpora, volume 3 of Corpus Linguistics and Interdisciplinary Perspectives on Language (CLIP), pages 221-234. Narr, Tübingen, 2013.
Jurish, B. & H. Ast. "Using an alignment-based lexicon for canonicalization of historical text." In J. Gippert & R. Gehrke (editors), Historical Corpora: Challenges and Perspectives, volume 5 of Corpus Linguistics and Interdisciplinary Perspectives on Language (CLIP), pages 197-208. Narr, Tübingen, 2015.
The corpus is distributed in multi-file XML format, where each file corresponds to a single DTA volume.
Bibliographic metadata are not included in the DTA EvalCorpus disrtibution. Metadata for the DTA source of a corpus file FILE.xml can be retrieved from the DTA website by accessing the URL http://www.deutschestextarchiv.de/book/show/FILE/, e.g. http://www.deutschestextarchiv.de/book/show/kant_aufklaerung_1784/. Additional metadata formats are also available, including:
The corpus is encoded for the most part as a "flat" XML structure, using only the following element XPaths:
Document root.
Document body.
A single sentence or sentence-like unit in the DTA source text, as predicted by a (now obsolete) heuristic tokenizer.
A single token in the DTA source text, as predicted by a (now obsolete) heuristic tokenizer.
A join-subtoken; see "Splitting and Joining".
Document nodes (/doc
) may contain the following attributes:
Basename of the DTA source file, i.e. FILE for the source file FILE.xml.
"Document OK", present and true iff document was manually judged safe for inclusion in the corpus.
Sentence nodes (//s
) may contain the following attributes:
"Sentence OK", present and true iff the sentence is deemed a valid sentence-like unit. On by default.
"Sentence bad", present and true iff the parent //s
node was manually marked as a non-sentence-like unit in the course of corpus editing (e.g. due to tokenizer errors).
Token nodes (//w
) may contain the following attributes:
Token text as appearing in the DTA edition.
Token text as appearing the contemporary edition, or manually assigned canonical contemporary cognate.
Present and true iff token or the canonicalization pair it represents have been manually or heuristically judged unsuitable for inclusion in training material, e.g. because of tokenizer errors, extinct lexemes, or "suspicious" constructions like hypenated compounds.
Coarse lexical class for this token, one of the set:
"normal" word or punctuation; assigned by default unless manually overridden.
Indicates a string of multiple source tokens to be canonicalized into a single target token. See "Splitting and Joining".
Indicates a single source token to be canonicalized into multiple target tokens. See "Splitting and Joining".
A proper name, e.g. a person or place name.
An error of graphical origin, e.g. a printing-, OCR-, or transcription error.
An encoding error in the DTA source corpus.
Foreign-language material.
Non-standard pseudo-phonetic dialect in both the DTA and contemporary editions.
An abbreviation.
Indicates a mis-inflected word or a word missing inflection.
An extinct lexeme without any contemporary cognate.
Editorial license in the contemporary edition (translation vs. standardization).
Work in progress.
Absent iff the canonicalization pair (@old -> @new)
was heuristically marked as "suspicious" and not manually re-confirmed.
Present and true iff type was manually flagged as requiring expert review.
Present and true iff the canonicalization pair (@old -> @new
) was manually edited during the token-level review phase.
Present and true iff the canonicalization pair type (@old -> @new
) represents an automatic alignment which was rejected during the type-wise confirmation phase.
Present and true iff the canonicalization pair type (@old -> @new
) has not been manually verified.
"Word ok" (dynamic flag): present and true iff ancestor document, sentence, and all sentence are valid.
In some cases, token boundaries did not map 1:1 from the source DTA text onto the contemporary edition. This corpus also encodes 1:n and n:1 mappings of adjacent tokens in conjunction with the JOIN
and SPLIT
token classes //s/w/@class.
A SPLIT
token is a single source token which is best canonicalized into multiple adjacent target tokens, e.g. "zweymal" -> "zwei mal". In this case, the token's @class
attribute will be the string SPLIT
, and its @new
attribute will be a space-separated list of canonical cognate targets, as in:
<w class="SPLIT" new="zwei mal" old="zweymal">
A JOIN
token is an adjacent string of source tokens best canonicalized as a single canonical cognate target. In this case, the original source tokens will be embedded in a single pseudo-token whose @old
attribute will be a space-separated list of the source tokens' text and whose @new
attribute is the manually assigned canonical cognate, as in:
<w class="JOIN" new="vorderhand" old="vor der Hand">
<w class="JOIN" new="vor" old="vor"/>
<w class="JOIN" new="der" old="der"/>
<w class="JOIN" new="Hand" old="Hand"/>
</w>
For both SPLIT
and JOIN
tokens, corpus annotators were asked to map each individual sub-token to a compositionally plausible contemporary equivalent wherever possible.
Additional administrative information such as full edit history and optional free-form editorial comments for each corpus unit (document, sentence, token) is not included in the corpus distribution.
The "prototype corpus" described in Jurish, Drotschmann, & Ast (2013) is included as a proper subset of this distribution. The works in this corpus subset were all aligned with contemporary editions from Project Gutenberg during the alignment phase. The basenames of the corresponding 13 distribution files are:
brentano_kasperl_1838
busch_max_1865
goethe_iphigenie_1787
goethe_lehrjahre01_1795
goethe_lehrjahre02_1795
goethe_lehrjahre03_1795
goethe_lehrjahre04_1796
goethe_torquato_1790
kant_aufklaerung_1784
lessing_menschengeschlecht_1780
schiller_kabale_1784
spyri_heidi_1880
storm_immensee_1852
Bryan Jurish <jurish@bbaw.de>