dtatw-files.perl - file formats used by dta-tokwrap utilities
FILENAME (STATUS) DESCRIPTION
*.xml (input) input XML file in DTA "base-format"
*.chr.xml (input) common convention for input files
*.char.xml (input) another common convention for input files
*.cx (temp) character index (CSV,TAB-separated)
*.sx (temp) structure index (XML)
*.tx (temp) text index (UTF-8 text)
*.bx0 (temp) preliminary "block index" (XML)
*.bx (temp) block index (CSV,TAB-separated)
*.txt (temp) serialized text (UTF-8 text)
*.t (temp) tokenizer output (.tt, TAB-separated)
*.cpx (temp) character+page index (CSV,TAB-separated)
*.wpx (temp) word+page index (CSV,TAB-separated)
*.t.xml (output) master serial XML output (XML)
*.s.xml (output) sentence-level standoff (XML)
*.w.xml (output) token-level standoff (XML)
*.a.xml (output) token-analysis-level standoff (XML)
*.u.xml (output) extended serial XML output (XML)
*.cw.xml (output) base-format + tokens (XML)
*.cws.xml (output) base-format + tokens + sentences (XML)
This manual describes the file formats currently used by the dta-tokwrap utilities.
Alias(es): *.chr.xml, *.char.xml
Input XML file in DTA "base-format" (UTF8-encoded XML with one c
element per character):
input documents MUST be encoded in UTF-8,
all text nodes to be tokenized should be descendants of a <c>
element which is itself a descendant of a <text>
element (XPath //text//c//text()
),
each input document should contain exactly one such <c>
element for each logical character which may be passed to the tokenizer,
no <c>
element may be a descendant of another <c>
element, and
each c
element should have a valid xml:id
attribute.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
<!-- ... -->
<text>
<!-- ... -->
<c xml:id="c1"> </c>
<c xml:id="c2">U</c>
<c xml:id="c3">e</c>
<c xml:id="c4">b</c>
<c xml:id="c5">e</c>
<c xml:id="c6">r</c>
<c xml:id="c7"> </c>
<c xml:id="c8">d</c>
<c xml:id="c9">i</c>
<c xml:id="c10">e</c>
<c xml:id="c11"> </c>
<!-- ... -->
</text>
<!-- ... -->
</text>
Character index file (TAB-separated text) as created by dtatw-mkindex. Used for translating between byte offsets and xml:id
s.
Example:
%% <c>-element index generated by ../src/dtatw-mkindex
%% Package: dta-tokwrap version 0.04 / svn+ssh://odo.dwds.de/home/svn/dev/dta-tokwrap/trunk @ 2445:2447
%% Command-line: ../src/dtatw-mkindex 'xmlsrc/ex1.xml' 'ex1.cx' 'ex1.sx' 'ex1.tx'
%%======================================================================
%% $ID$ $XML_OFFSET$ $XML_LENGTH$ $TXT_OFFSET$ $TXT_LEN$ $TEXT$
c1 276 20 0 1
c2 382 20 1 1 U
c3 402 20 2 1 e
c4 422 20 3 1 b
c5 442 20 4 1 e
c6 462 20 5 1 r
c7 482 20 6 1
c8 502 20 7 1 d
c9 522 20 8 1 i
c10 542 21 9 1 e
c11 563 21 10 1
Structure index (XML) as created by dtatw-mkindex. All XPaths //text//c|//text//lb
have been removed and replaced by placeholder c
elements for each contiguous block of original c
and lb
elements. The placeholder elements have the form:
<c n="XOFF XLEN TOFF TLEN"/>
where XOFF,XLEN are byte-offset and -length in the source XML file (*.xml) and TOFF,TLEN are byte-offset and -length in the raw text index file (*.tx).
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
<!-- ... -->
<text>
<titlePage>
<c n="338 11 1 0"/>
<docTitle>
<c n="349 10 1 0"/>
<titlePart type="main">
<c n="359 23 1 0"/>
<c n="382 1666 1 82"/>
</titlePart>
<c n="2048 12 83 0"/>
<c n="2060 5 83 1"/>
<!-- ... -->
</titlePage>
</text>
<!-- ... -->
</text>
Raw, unserialized text index (UTF-8 text) as created by dtatw-mkindex. Results from concatenating all //text//c//text()
nodes from the source document, and inserting newlines for //text//lb
elements.
Example:
Ueber die Beeinflussung
einfacher psychischer Vorgänge
durch einige Arzneimittel.
Experimentelle Untersuchungen
von
Dr. Emil Kraepelin,
Professor der Psychiatrie in Heidelberg.
Mit einer Curventafel.
Jena,
Verlag von Gustav Fischer.
1892.
Preliminary "block index" (XML) as created by "dta-tokwrap.perl -t mkbx0". Generated from the *.sx file by inserting zero or more "hints" of one of the following forms:
<s/> <!-- sentence-break hint -->
<w/> <!-- token-break hint -->
<lb/> <!-- line-break hint -->
Zero or more output elements may also be assigned a dta.tw.key
attribute, which should be some unique key identifying the logical block or segment with which any text descended from that element should be sorted during serialization (this is how we get seg
elements to clump together). dta.tw.key
attributes are inherited by default.
Also note that namespaces have been forcibly removed from the XML structure.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<TEI dta.tw.key="TEI.id2369102" _xmlns="http://www.tei-c.org/ns/1.0" xmlns_dta="http://www.deutsches-textarchiv.de/ns/1.0">
<!-- ... -->
<text>
<titlePage>
<s/>
<c n="338 11 1 0"/>
<docTitle>
<c n="349 10 1 0"/>
<titlePart type="main">
<s/>
<c n="359 23 1 0"/>
<c n="382 1666 1 82"/>
<s/>
</titlePart>
<c n="2048 12 83 0"/>
<c n="2060 5 83 1"/>
</s>
</titlePage>
</text>
<!-- ... -->
</TEI>
Block index (TAB-separated text) as created by "dta-tokwrap.perl -t mkbx". Used for translating between serialized-text (.txt) byte offsets and raw-text (.tx) byte offsets, which in turn gets us to c/@xml:id
s. Still with me? Good.
Example:
%% XML block list file generated by DTA::TokWrap::Document::saveBxFile() (DTA::TokWrap version 0.04)
%% Original source file: ./xmlsrc/ex1.xml
%%======================================================================
%% $KEY$ $ELT$ $XML_OFFSET$ $XML_LENGTH$ $TX_OFFSET$ $TX_LEN$ $TXT_OFFSET$ $TXT_LEN$
__ROOT__ __ROOT__ 0 0 0 0 0 0
TEI.id2406247 s 176 0 0 0 0 6
TEI.id2406247 s 176 0 0 0 6 6
TEI.id2406247 s 215 0 0 0 12 6
TEI.id2406247 s 227 0 0 0 18 6
TEI.id2406247 s 258 0 0 0 24 6
TEI.id2406247 c 270 26 0 1 30 1
TEI.id2406247 s 270 0 0 0 31 6
Serialized text (UTF-8 text) as created by "dta-tokwrap.perl -t mktxt", possibly containing tokenizer "hints", to be passed to the underlying tokenizer.
The precise form taken by the hints in this file depends on many things, notably the options --strong-hints
, --weak-hints
, and --no-hints
to dta-tokwrap.perl. You should ensure that your tokenizer is prepared to deal with whatever flavor of hints you are passing it (in particular, don't use the dwds_tomasotath
tokenizer together with the --strong-hints
option, unless you want it to return a lot of ($
, WB
, $
) "tokens".
Example:
$SB$
Ueber die Beeinflussung
einfacher psychischer Vorgänge
durch einige Arzneimittel.
$SB$
$SB$
Experimentelle Untersuchungen
$SB$
Tokenizer output (.tt, TAB-separated UTF-8 text). The first non-text field should contain "TXTOFF TXTLEN" pairs, where TXTOFF and TXTLEN are byte-offset and -length in the *.txt file. These data are required for recovery of c
element IDs. See mootfiles(5) for details on the file format.
Example:
%% raw tokenizer output generated by ../src/dtatw-tokenize-dummy (dta-tokwrap version 0.04)
Ueber 49 5
die 55 3
Beeinflussung 59 13
einfacher 73 9
psychischer 83 11
Vorgänge 95 9
durch 105 5
einige 111 6
Arzneimittel 118 12
. 130 1 $.
Character+pagebreak index (CSV, TAB-separated). Used in generation of *.u.xml files.
Example:
%% <(^c$)>+<pb> index generated by ../scripts/dtatw-mkpx.perl
%%======================================================================
%%$X_ID $PB_I $PB_N $PB_FACS $X_XPATH
c1 0 NULL NULL /TEI[1]/text[1]/c[1]
c2 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[1]
c3 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[2]
c4 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[3]
c5 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[4]
c6 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[5]
c7 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[6]
c8 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[7]
Token+pagebreak index (CSV, TAB-separated). Used in generation of *.u.xml files. Format is same as *.cpx, but IDs are token-ids.
Example:
%% <(^w$)>+<pb> index generated by ../scripts/dtatw-mkpx.perl
%%======================================================================
%%$X_ID $PB_I $PB_N $PB_FACS $X_XPATH
w1 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[1]
w2 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[2]
w3 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[3]
w4 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[4]
w5 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[5]
w6 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[6]
w7 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[7]
w8 7 NULL NULL /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[8]
Master XML-ified tokenizer output (XML). X-Paths:
/*/s : sentence
/*/s/w : token: <w @xml:id b="TXTOFF TXTLEN" t="TEXT" c="C_IDS">...</w>
//w/a : token analysis: <a>ANALYSIS_TEXT</a>
//w//* : (additional analysis data, inserted e.g. by DTA::CAB utilities)
//w/@xml:id : token id (unique within document, counted in serialized order)
//w/@b : byte-offset and length of token in tokenizer input *.txt
//w/@t : token text as output by tokenizer
//w/@c : space-separated list of //c/@id for token characters
This format can also be passed directly to and from the DTA::CAB(3pm) analysis suite using the DTA::CAB::Format::XmlNative(3pm) formatter class.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<sentences xml:base="ex1.xml">
<s xml:id="s1">
<w xml:id="w1" b="49 5" t="Ueber" c="c2 c3 c4 c5 c6"/>
<w xml:id="w2" b="55 3" t="die" c="c8 c9 c10"/>
<w xml:id="w3" b="59 13" t="Beeinflussung" c="c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24"/>
<w xml:id="w4" b="73 9" t="einfacher" c="c25 c26 c27 c28 c29 c30 c31 c32 c33"/>
<w xml:id="w5" b="83 11" t="psychischer" c="c35 c36 c37 c38 c39 c40 c41 c42 c43 c44 c45"/>
<w xml:id="w6" b="95 9" t="Vorgänge" c="c47 c48 c49 c50 c51 c52 c53 c54"/>
<w xml:id="w7" b="105 5" t="durch" c="c55 c56 c57 c58 c59"/>
<w xml:id="w8" b="111 6" t="einige" c="c61 c62 c63 c64 c65 c66"/>
<w xml:id="w9" b="118 12" t="Arzneimittel" c="c68 c69 c70 c71 c72 c73 c74 c75 c76 c77 c78 c79"/>
<w xml:id="w10" b="130 1" t="." c="c80">
<a>$.</a>
</w>
</s>
<!-- ... -->
</sentences>
Sentence-level standoff XML. DEPRECATED in favor of *.t.xml, *.u.xml.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<sentences xml:base="ex1.w.xml">
<s xml:id="s1">
<w ref="#w1"/>
<w ref="#w2"/>
<w ref="#w3"/>
<w ref="#w4"/>
<w ref="#w5"/>
<w ref="#w6"/>
<w ref="#w7"/>
<w ref="#w8"/>
<w ref="#w9"/>
<w ref="#w10"/>
</s>
<!-- ... -->
</sentences>
Token-level standoff XML. DEPRECATED in favor of *.t.xml, *.u.xml.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<tokens xml:base="ex1.xml">
<w xml:id="w1" t="Ueber">
<c ref="#c2"/>
<c ref="#c3"/>
<c ref="#c4"/>
<c ref="#c5"/>
<c ref="#c6"/>
</w>
<w xml:id="w2" t="die">
<c ref="#c8"/>
<c ref="#c9"/>
<c ref="#c10"/>
</w>
<w xml:id="w3" t="Beeinflussung">
<c ref="#c12"/>
<c ref="#c13"/>
<c ref="#c14"/>
<c ref="#c15"/>
<c ref="#c16"/>
<c ref="#c17"/>
<c ref="#c18"/>
<c ref="#c19"/>
<c ref="#c20"/>
<c ref="#c21"/>
<c ref="#c22"/>
<c ref="#c23"/>
<c ref="#c24"/>
</w>
<!-- ... -->
</tokens>
Token-analysis-level standoff XML. Currently contains only analyses supplied by the tokenizer. DEPRECATED in favor of *.t.xml, *.u.xml.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<analyses xml:base="ex1.w.xml">
<a ref="#w10">$.</a>
<a ref="#w14">$ABBR</a>
<a ref="#w17">$,</a>
<a ref="#w23">$.</a>
<a ref="#w27">$.</a>
<a ref="#w29">$,</a>
<a ref="#w34">$.</a>
<a ref="#w35">$CARDPUNCT</a>
<!-- ... -->
</analyses>
Extended serialized XML format, based on *.t.xml with additional XPaths:
//s/@xp : common source-XML XPath prefix for all sentence tokens
//w/@xp : XPath suffix (of ../@xp) for token
//w/@t0 : tokenizer input text (including e.g. newlines) if different from @t
//w/@u : unicruft approximation of @t, if not equal to @t
//w/@u0 : unicruft approximation of @t0m if not equal to @u
//w/@pb : index of last //pb before onset of //w
//w/@cs : character spans: "CID+LEN CID+LEN ... CID+LEN"; replaces @c
... and removed XPaths:
//w/@c : removed in favor of //w/@cs
//w/@b : removed in favor of //w/@cs, //w/@t0
Example:
<?xml version="1.0" encoding="UTF-8"?>
<sentences xml:base="ex1a.xml">
<s xml:id="s1" xp="/TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]">
<w xml:id="w1" t="Ueber" pb="7" xp="-/c[1]" cs="c2+5"/>
<w xml:id="w2" t="die" pb="7" xp="-/c[7]" cs="c8+3"/>
<w xml:id="w3" t="Beeinflussung" pb="7" xp="-/c[11]" cs="c12+13"/>
<w xml:id="w4" t="einfacher" pb="7" xp="-/c[24]" cs="c25+9"/>
<w xml:id="w5" t="psychischer" pb="7" xp="-/c[34]" cs="c35+11"/>
<w xml:id="w6" t="Vorg�nge" pb="7" xp="-/c[46]" cs="c47+8"/>
<w xml:id="w7" t="durch" pb="7" xp="-/c[54]" cs="c55+5"/>
<w xml:id="w8" t="einige" pb="7" xp="-/c[60]" cs="c61+6"/>
<w xml:id="w9" t="Arzneimittel" pb="7" xp="-/c[67]" cs="c68+12"/>
<w xml:id="w10" t="." pb="7" xp="-/c[79]" cs="c80+1">
<a>$.</a>
</w>
</s>
</sentences>
Base-format XML file with tokens encoded as w
elements, as output by dtatw-add-w.perl.
Example:
<?xml version="1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
<!-- ... -->
<text>
<!-- ... -->
<titlePart type="main">
<w xml:id="w1">
<c xml:id="c2">U</c>
<c xml:id="c3">e</c>
<c xml:id="c4">b</c>
<c xml:id="c5">e</c>
<c xml:id="c6">r</c>
</w>
<c xml:id="c7"> </c>
<w xml:id="w2">
<c xml:id="c8">d</c>
<c xml:id="c9">i</c>
<c xml:id="c10">e</c>
</w>
<c xml:id="c11"> </c>
<!-- ... -->
<w xml:id="w10">
<c xml:id="c80">.</c>
</w>
</titlePart>
<!-- ... -->
</text>
<!-- ... -->
</TEI>
Base-format XML file with tokens and sentences encoded as w
and s
elements respectively, as output by dtatw-add-s.perl.
Example:
<?xml version="1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
<!-- ... -->
<text>
<!-- ... -->
<titlePart type="main">
<s xml:id="s1">
<w xml:id="w1">
<c xml:id="c2">U</c>
<c xml:id="c3">e</c>
<c xml:id="c4">b</c>
<c xml:id="c5">e</c>
<c xml:id="c6">r</c>
</w>
<c xml:id="c7"> </c>
<w xml:id="w2">
<c xml:id="c8">d</c>
<c xml:id="c9">i</c>
<c xml:id="c10">e</c>
</w>
<c xml:id="c11"> </c>
<!-- ... -->
<w xml:id="w10">
<c xml:id="c80">.</c>
</w>
</s>
</titlePart>
<!-- ... -->
</text>
<!-- ... -->
</TEI>
dtatw-add-c.perl(1), dtatw-add-w.perl(1), dtatw-add-s.perl(1), dta-tokwrap.perl(1), dtatw-txml2uxml.perl(1), DTA::TokWrap::Intro(3pm), ...
Bryan Jurish <jurish@bbaw.de>
Hey! The above document had some coding errors, which are explained below:
Non-ASCII character seen before =encoding in 'Vorgänge'. Assuming UTF-8