NAME

DTA::CAB::Format::TCF - Datum parser|formatter: CLARIN-D TCF (selected features only)

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DTA::CAB::Format::TCF;
 
 ##========================================================================
 ## Constructors etc.
 
 $fmt = CLASS_OR_OBJ->new(%args);
 
 ##========================================================================
 ## Methods: Input: Generic API
 
 $doc = $fmt->parseDocument();
 
 ##========================================================================
 ## Methods: Output: MIME & HTTP stuff
 
 $short = $fmt->shortName();
 $type = $fmt->mimeType();
 $ext = $fmt->defaultExtension();
 
 ##========================================================================
 ## Methods: Output: output selection
 
 $fmt = $fmt->flush();
 
 ##========================================================================
 ## Methods: Output: Generic API
 
 $fmt = $fmt->putDocument($doc);

DESCRIPTION

Globals

Variable: @ISA: DTA::CAB::Format::TCF inherits from DTA::CAB::Format::XmlCommon.

Constructors etc.

new

 $fmt = CLASS_OR_OBJ->new(%args);

object structure: HASH ref

    {
     ##-- new in TCF
     tcfbufr => \$buf,                       ##-- raw TCF buffer, for spliceback mode
     textbufr => \$text,                     ##-- raw text buffer, for spliceback mode
     tcflog  => $level,                ##-- debugging log-level (default: 'off')
     spliceback => $bool,                    ##-- (output) if true (default), splice data back into 'tcfbufr' if available; otherwise create new TCF doc
     tcflayers => $tcf_layer_names,          ##-- layer names to include, space-separated list; known='tei text tokens sentences postags lemmas orthography'
     tcftagset => $tagset,                   ##-- tagset name for POStags element (default='stts')
     logsplice => $level,                      ##-- log level for spliceback messages (default:'none')
     trimtext => $bool,                      ##-- if true (default), waste tokenizer hints will be trimmed from 'text' layer
     ##-- input: inherited from XmlCommon
     xdoc => $xdoc,                          ##-- XML::LibXML::Document
     xprs => $xprs,                          ##-- XML::LibXML parser
     ##-- output: inherited from XmlCommon
     level => $level,                        ##-- output formatting level (OVERRIDE: default=1)
     output => [$how,$arg]                   ##-- either ['fh',$fh], ['file',$filename], or ['str',\$buf]
    }

Methods: Input: Generic API

parseDocument

 $doc = $fmt->parseDocument();

parse buffered XML::LibXML::Document from $fmt->{xdoc}

Methods: Output: MIME & HTTP stuff

shortName

 $short = $fmt->shortName();

returns "official" short name for this format; override returns "tcf".

mimeType

 $type = $fmt->mimeType();

override returns text/xml

defaultExtension

 $ext = $fmt->defaultExtension();

returns default filename extension for this format; override returns ".tcf.xml".

Methods: Output: output selection

flush

 $fmt = $fmt->flush();

flush any buffered output to selected output source

Methods: Output: Generic API

putDocument

 $fmt = $fmt->putDocument($doc);

override respects local 'spliceback' and 'tcflayers' flags

EXAMPLE

An example file in the format accepted/generated by this module is:

 <?xml version="1.0" encoding="UTF-8"?>
 <D-Spin xmlns="http://www.dspin.de/data" version="0.4">
  <MetaData xmlns="http://www.dspin.de/data/metadata"/>
  <TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de">
    <text>wie oede!</text>
    <tokens>
      <token ID="w1">wie</token>
      <token ID="w2">oede</token>
      <token ID="w3">!</token>
    </tokens>
    <sentences>
      <sentence ID="s1" tokenIDs="w1 w2 w3"/>
    </sentences>
    <lemmas>
      <lemma tokenIDs="w1">wie</lemma>
      <lemma tokenIDs="w2">öde</lemma>
      <lemma tokenIDs="w3">!</lemma>
    </lemmas>
    <POStags tagset="stts">
      <tag tokenIDs="w1">PWAV</tag>
      <tag tokenIDs="w2">ADJD</tag>
      <tag tokenIDs="w3">$.</tag>
    </POStags>
    <orthography>
      <correction tokenIDs="w2" operation="replace">öde</correction>
    </orthography>
  </TextCorpus>
 </D-Spin>

If the input contains a 'text' layer but no 'tokens' or 'sentences' layers, the 'text' layer will be tokenized using the DTA::CAB::Format::Raw class.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.