NAME

DTA::CAB::Format::TEI - Datum parser|formatter: TEI-XML using DTA::TokWrap

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DTA::CAB::Format::TEI;
 
 ##========================================================================
 ## Constructors etc.
 
 $fmt = CLASS_OR_OBJ->new(%args);
 $fmt->DESTROY();
 
 ##========================================================================
 ## Methods: Generic
 
 $dir = $fmt->tmpdir();
 $tmpdir = $fmt->mktmpdir();
 $fmt = $fmt->rmtmpdir();
 $txmlfmt = $fmt->txmlfmt();
 $class = $fmt->txmlclass();
 $tw = $fmt->tw();
 
 ##========================================================================
 ## Methods: Input: Generic API
 
 $fmt = $fmt->close();
 $fmt = $fmt->fromString(\$string);
 $fmt = $fmt->fromFile($filename_or_handle);
 $fmt = $fmt->fromFh($handle);
 $doc = $fmt->parseDocument();
 
 ##========================================================================
 ## Methods: Output: MIME & HTTP stuff
 
 $short = $fmt->shortName();
 $ext = $fmt->defaultExtension();
 
 ##========================================================================
 ## Methods: Output: output selection
 
 $fmt = $fmt->flush();
 $fmt = $fmt->toString(\$str);
 $fmt_or_undef = $fmt->toFile($filename, $formatLevel);
 $fmt_or_undef = $fmt->toFh($fh,$formatLevel);
 
 ##========================================================================
 ## Methods: Output: Generic API
 
 $fmt = $fmt->putDocument($doc);
 

DESCRIPTION

Globals

Variable: @ISA

DTA::CAB::Format::TEI inherits from DTA::CAB::Format::XmlTokWrap.

Variable: $TXML_CLASS_DEFAULT

Default parser/formatter class for *.t.xml files; by default DTA::CAB::Format::XmlTokWrap. The alternative DTA::CAB::Format::XmlTokWrapFast is ca. 2x faster, but doesn't support all token attributes.

Constructors etc.

new
 $fmt = CLASS_OR_OBJ->new(%args);

object structure: HASH ref

    {
     ##-- new in TEI
     tmpdir => $dir,                         ##-- temporary directory for this object (default: new)
     keeptmp => $bool,                       ##-- keep temporary directory open
     teilog => 'off',                        ##-- tei format debug log level
     twlog => 'off',                         ##-- DTA::TokWrap debug log level (also consider specifying e.g. -lo=twLevel=TRACE on the command-line)
     addc => $bool_or_guess,                 ##-- (input) whether to add //c elements (slow no-op if already present; default=0)
     spliceback => $bool,                    ##-- (output) if true (default), return .cws.cab.xml ; otherwise just .cab.t.xml [requires doc 'teibufr' attribute]
     keeptext => $bool,                      ##-- (input) if true (default), include 'textbufr' element for extract TEI text
     keepc => $bool,                         ##-- (output) whether to include //c elements in spliceback-mode output (default=0)
     tw => $tw,                              ##-- underlying DTA::TokWrap object
     twopen => \%opts,                       ##-- options for $tw->open()
     teibufr => \$buf,                       ##-- raw tei+c buffer, for spliceback mode
     textbufr => \$buf,                      ##-- raw text buffer, for keeptext mode
     txmlfmt   => $fmt,                      ##-- classname or object for parsing tokwrap *.t.xml files (default: DTA::CAB::Format::TokWrap)
     txmlopts  => \%opts,                    ##-- options for *.t.xml sub-formatter (clobbers %$fmt options)
     'att.linguistic' => $bool,              ##-- use TEI att.linguistic features? (forces txmlfmt, txmlopts, twopts)
     ##
     ##-- input: inherited from XmlNative
     xdoc => $xdoc,                          ##-- XML::LibXML::Document
     xprs => $xprs,                          ##-- XML::LibXML parser
     ##
     ##-- output: new
     #outfile => $filename,                   ##-- final output file (flushed with File::Copy::copy)
     ##
     ##-- output: inherited from XmlTokWrap
     arrayEltKeys => \%akey2ekey,            ##-- maps array keys to element keys for output
     arrayImplicitKeys => \%akey2undef,      ##-- pseudo-hash of array keys NOT mapped to explicit elements
     key2xml => \%key2xml,                   ##-- maps keys to XML-safe names
     xml2key => \%xml2key,                   ##-- maps xml keys to internal keys
     ##
     ##-- output: inherited from XmlNative
     #encoding => $inputEncoding,             ##-- default: UTF-8; applies to output only!
     level => $level,                        ##-- output formatting level (default=0)
     ##
     ##-- common: safety
     safe => $bool,                          ##-- if true (default), no "unsafe" token data will be generated (_xmlnod,etc.)
    }
DESTROY
 $fmt->DESTROY();

destructor implicitly calls $fmt->rmtmpdir()

Methods: Generic

tmpdir
 $dir = $fmt->tmpdir();

get/generate name of temporary directory, ensures $fmt->{tmpdir} is set

mktmpdir
 $tmpdir = $fmt->mktmpdir();

ensures $fmt->tmpdir() exists

rmtmpdir
 $fmt = $fmt->rmtmpdir();

removes $fmt->{tmpdir} unless $fmt->{keeptmp} is true

txmlfmt
 $txmlfmt = $fmt->txmlfmt();

gets cached $fmt->{txmlfmt} or creates it

txmlclass
 $class = $fmt->txmlclass();

(undocumented)

tw
 $tw = $fmt->tw();

returns DTA::TokWrap object for $fmt; calls $fmt->tmpdir()

Methods: Input: Generic API

close
 $fmt = $fmt->close();

close current input source, if any

fromString
 $fmt = $fmt->fromString(\$string);

select input from string $string

fromFile
 $fmt = $fmt->fromFile($filename_or_handle);

calls $fmt->fromFh()

fromFh
 $fmt = $fmt->fromFh($handle);

just calls $fmt->fromString()

parseDocument
 $doc = $fmt->parseDocument();

parses buffered XML::LibXML::Document; local override inserts $doc->{teibufr}, $doc->{textbufr} attributes for spliceback mode

Methods: Output: MIME & HTTP stuff

shortName
 $short = $fmt->shortName();

returns "official" short name for this format; override returns "tei".

defaultExtension
 $ext = $fmt->defaultExtension();

returns default filename extension for this format; override returns ".tei.xml".

Methods: Output: output selection

flush
 $fmt = $fmt->flush();

flush any buffered output to selected output source; override calls $fmt->buf2fh(\$fmt->{outbuf}, $fmt->{fh})

toString
 $fmt = $fmt->toString(\$str);
 $fmt = $fmt->toString(\$str,$formatLevel)

select output to byte-string; override reverts to DTA::CAB::Format::toString()

toFile
 $fmt_or_undef = $fmt->toFile($filename, $formatLevel);

select output to $filename; override reverts to DTA::CAB::Format::toFile().

toFh
 $fmt_or_undef = $fmt->toFh($fh,$formatLevel);

select output to filehandle $fh; override reverts to DTA::CAB::Format::toFh()

Methods: Output: Generic API

putDocument
 $fmt = $fmt->putDocument($doc);

override respects local 'keepc' and 'spliceback' flags

EXAMPLE

An example input file in the format as accepted by this module is:

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI>
   <text>
     <fw>Running headers are ignored</fw>
     Wie oede!<lb/>
   </text>
 </TEI>

An example output file in the format returned by this module is:

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI>
   <text>
     <fw>Running headers are ignored</fw>
     <s lang="de">
       <w msafe="1" t="wie" errid="ec" hasmorph="1" exlex="wie" lang="de">
         <moot word="wie" lemma="wie" tag="PWAV"/>
         <xlit isLatinExt="1" isLatin1="1" latin1Text="wie"/>
       </w>
       <w msafe="0" t="oede">
         <moot tag="ADJD" lemma="öde" word="öde"/>
         <xlit isLatinExt="1" isLatin1="1" latin1Text="oede"/>
       </w>
       <w exlex="!" errid="ec" t="!" msafe="1">
         <xlit latin1Text="!" isLatin1="1" isLatinExt="1"/>
         <moot word="!" tag="$." lemma="!"/>
       </w>
     </s>
     <lb/>
   </text>
 </TEI>

Any //s or //w elements in the input will be IGNORED and input will be (re-)tokenized. Outputs files are themselves parseable by DTA::CAB::Format::TEIws.

att.linguistic Example

An example output file in the format returned by this module with the att.linguistic option set to a true value is:

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI>
   <text>
     <fw>Running headers are ignored</fw>
     <s xml:id="s1">
       <w xml:id="w1" lemma="wie" pos="PWAV" norm="Wie">Wie</w>
       <w xml:id="w2" lemma="öde" pos="ADJD" norm="öde" join="right">oede</w>
       <w xml:id="w3" lemma="!" pos="$." norm="!" join="left">!</w>
     </s>
     <lb/>
   </text>
 </TEI>

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2011-2019 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dta-cab-analyze.perl(1), dta-cab-convert.perl(1), dta-cab-http-server.perl(1), dta-cab-http-client.perl(1), dta-cab-xmlrpc-server.perl(1), dta-cab-xmlrpc-client.perl(1), DTA::CAB::Server(3pm), DTA::CAB::Client(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...