README for DTA EvalCorpus

This README was last updated for the DTA EvalCorpus v0.03, 2016-09-13.

DESCRIPTION

The DTA EvalCorpus contains token-aligned training+evaluation data for canonicalization/normalization of historical German text. See the references below under "ATTRIBUTION" for details on the corpus construction.

CONTRIBUTORS

Bryan Jurish, Henriette Ast, Marko Drotschmann, and Christian Thomas.

LICENSING

This corpus was created by semi-automatic alignment of a Deutsches Textarchiv (DTA) text with a contemporary edition drawn from Project Gutenberg and/or the TextGrid Digital Library.

The Deutsches Textarchiv text sources are distributed under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported License, see http://creativecommons.org/licenses/by-nc/3.0/.

Contemporary text sources from Project Gutenberg are distributed under the terms of the The Project Gutenberg License, see http://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License

Contemporary text sources from the TextGrid Digital Library were provided by TextGrid from the data stock of TextGrid's Digital Library, www.editura.de, and are published under the Creative Commons "by" license version 3.0, see http://creativecommons.org/licenses/by/3.0/ and http://textgrid.de/en/digitale-bibliothek

This corpus itself is distributed under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported License, see http://creativecommons.org/licenses/by-nc/3.0/

ATTRIBUTION

If you make use of this corpus in your research, we ask that you cite one or both of the following articles in any associated publications:

CORPUS FORMAT

The corpus is distributed in multi-file XML format, where each file corresponds to a single DTA volume.

Metadata

Bibliographic metadata are not included in the DTA EvalCorpus disrtibution. Metadata for the DTA source of a corpus file FILE.xml can be retrieved from the DTA website by accessing the URL http://www.deutschestextarchiv.de/book/show/FILE/, e.g. http://www.deutschestextarchiv.de/book/show/kant_aufklaerung_1784/. Additional metadata formats are also available, including:

TEI header

www.deutschestextarchiv.de/api/tei_header/FILE

CMDI

www.deutschestextarchiv.de/api/cmdi/FILE

Dublin Core

www.deutschestextarchiv.de/api/oai_dc/FILE

Markup

The corpus is encoded for the most part as a "flat" XML structure, using only the following element XPaths:

/doc

Document root.

/doc/body

Document body.

/doc/body/s

A single sentence or sentence-like unit in the DTA source text, as predicted by a (now obsolete) heuristic tokenizer.

/doc/body/s/w

A single token in the DTA source text, as predicted by a (now obsolete) heuristic tokenizer.

/doc/body/s/w/w

A join-subtoken; see "Splitting and Joining".

Document Attributes

Document nodes (/doc) may contain the following attributes:

@base

Basename of the DTA source file, i.e. FILE for the source file FILE.xml.

@dok

"Document OK", present and true iff document was manually judged safe for inclusion in the corpus.

Sentence Attributes

Sentence nodes (//s) may contain the following attributes:

@sok

"Sentence OK", present and true iff the sentence is deemed a valid sentence-like unit. On by default.

@sbad

"Sentence bad", present and true iff the parent //s node was manually marked as a non-sentence-like unit in the course of corpus editing (e.g. due to tokenizer errors).

Token Attributes

Token nodes (//w) may contain the following attributes:

@old

Token text as appearing in the DTA edition.

@new

Token text as appearing the contemporary edition, or manually assigned canonical contemporary cognate.

@bad

Present and true iff token or the canonicalization pair it represents have been manually or heuristically judged unsuitable for inclusion in training material, e.g. because of tokenizer errors, extinct lexemes, or "suspicious" constructions like hypenated compounds.

@class

Coarse lexical class for this token, one of the set:

LEX

"normal" word or punctuation; assigned by default unless manually overridden.

JOIN

Indicates a string of multiple source tokens to be canonicalized into a single target token. See "Splitting and Joining".

SPLIT

Indicates a single source token to be canonicalized into multiple target tokens. See "Splitting and Joining".

NAME

A proper name, e.g. a person or place name.

GRAPH

An error of graphical origin, e.g. a printing-, OCR-, or transcription error.

BUG

An encoding error in the DTA source corpus.

FM

Foreign-language material.

DIAL

Non-standard pseudo-phonetic dialect in both the DTA and contemporary editions.

ABBR

An abbreviation.

INFL

Indicates a mis-inflected word or a word missing inflection.

GONE

An extinct lexeme without any contemporary cognate.

XEDIT

Editorial license in the contemporary edition (translation vs. standardization).

WIP

Work in progress.

@pok

Absent iff the canonicalization pair (@old -> @new) was heuristically marked as "suspicious" and not manually re-confirmed.

@review

Present and true iff type was manually flagged as requiring expert review.

@seen

Present and true iff the canonicalization pair (@old -> @new) was manually edited during the token-level review phase.

@unaligned

Present and true iff the canonicalization pair type (@old -> @new) represents an automatic alignment which was rejected during the type-wise confirmation phase.

@unverified

Present and true iff the canonicalization pair type (@old -> @new) has not been manually verified.

@wok

"Word ok" (dynamic flag): present and true iff ancestor document, sentence, and all sentence are valid.

Splitting and Joining

In some cases, token boundaries did not map 1:1 from the source DTA text onto the contemporary edition. This corpus also encodes 1:n and n:1 mappings of adjacent tokens in conjunction with the JOIN and SPLIT token classes //s/w/@class.

A SPLIT token is a single source token which is best canonicalized into multiple adjacent target tokens, e.g. "zweymal" -> "zwei mal". In this case, the token's @class attribute will be the string SPLIT, and its @new attribute will be a space-separated list of canonical cognate targets, as in:

 <w class="SPLIT" new="zwei mal" old="zweymal">

A JOIN token is an adjacent string of source tokens best canonicalized as a single canonical cognate target. In this case, the original source tokens will be embedded in a single pseudo-token whose @old attribute will be a space-separated list of the source tokens' text and whose @new attribute is the manually assigned canonical cognate, as in:

 <w class="JOIN" new="vorderhand" old="vor der Hand">
   <w class="JOIN" new="vor" old="vor"/>
   <w class="JOIN" new="der" old="der"/>
   <w class="JOIN" new="Hand" old="Hand"/>
 </w>

For both SPLIT and JOIN tokens, corpus annotators were asked to map each individual sub-token to a compositionally plausible contemporary equivalent wherever possible.

Administrivia

Additional administrative information such as full edit history and optional free-form editorial comments for each corpus unit (document, sentence, token) is not included in the corpus distribution.

PROTOTYPE SUBCORPUS

The "prototype corpus" described in Jurish, Drotschmann, & Ast (2013) is included as a proper subset of this distribution. The works in this corpus subset were all aligned with contemporary editions from Project Gutenberg during the alignment phase. The basenames of the corresponding 13 distribution files are:

 brentano_kasperl_1838
 busch_max_1865
 goethe_iphigenie_1787
 goethe_lehrjahre01_1795
 goethe_lehrjahre02_1795
 goethe_lehrjahre03_1795
 goethe_lehrjahre04_1796
 goethe_torquato_1790
 kant_aufklaerung_1784
 lessing_menschengeschlecht_1780
 schiller_kabale_1784
 spyri_heidi_1880
 storm_immensee_1852

CONTACT

Bryan Jurish <jurish@bbaw.de>