DTA::CAB::Format::TCF - Datum parser|formatter: CLARIN-D TCF (selected features only)
##========================================================================
## PRELIMINARIES
use DTA::CAB::Format::TCF;
##========================================================================
## Constructors etc.
$fmt = CLASS_OR_OBJ->new(%args);
##========================================================================
## Methods: Input: Generic API
$doc = $fmt->parseDocument();
##========================================================================
## Methods: Output: MIME & HTTP stuff
$short = $fmt->shortName();
$type = $fmt->mimeType();
$ext = $fmt->defaultExtension();
##========================================================================
## Methods: Output: output selection
$fmt = $fmt->flush();
##========================================================================
## Methods: Output: Generic API
$fmt = $fmt->putDocument($doc);
DTA::CAB::Format::TCF inherits from DTA::CAB::Format::XmlCommon.
$fmt = CLASS_OR_OBJ->new(%args);
object structure: HASH ref
{
##-- new in TCF
tcfbufr => \$buf, ##-- raw TCF buffer, for spliceback mode
textbufr => \$text, ##-- raw text buffer, for spliceback mode
tcflog => $level, ##-- debugging log-level (default: 'off')
spliceback => $bool, ##-- (output) if true (default), splice data back into 'tcfbufr' if available; otherwise create new TCF doc
tcflayers => $tcf_layer_names, ##-- layer names to include, space-separated list; known='tei text tokens sentences postags lemmas orthography'
tcftagset => $tagset, ##-- tagset name for POStags element (default='stts')
logsplice => $level, ##-- log level for spliceback messages (default:'none')
trimtext => $bool, ##-- if true (default), waste tokenizer hints will be trimmed from 'text' layer
##-- input: inherited from XmlCommon
xdoc => $xdoc, ##-- XML::LibXML::Document
xprs => $xprs, ##-- XML::LibXML parser
##-- output: inherited from XmlCommon
level => $level, ##-- output formatting level (OVERRIDE: default=1)
output => [$how,$arg] ##-- either ['fh',$fh], ['file',$filename], or ['str',\$buf]
}
$doc = $fmt->parseDocument();
parse buffered XML::LibXML::Document from $fmt->{xdoc}
$short = $fmt->shortName();
returns "official" short name for this format; override returns "tcf".
$type = $fmt->mimeType();
override returns text/xml
$ext = $fmt->defaultExtension();
returns default filename extension for this format; override returns ".tcf.xml".
$fmt = $fmt->flush();
flush any buffered output to selected output source
$fmt = $fmt->putDocument($doc);
override respects local 'spliceback' and 'tcflayers' flags
An example file in the format accepted/generated by this module is:
<?xml version="1.0" encoding="UTF-8"?>
<D-Spin xmlns="http://www.dspin.de/data" version="0.4">
<MetaData xmlns="http://www.dspin.de/data/metadata"/>
<TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de">
<text>wie oede!</text>
<tokens>
<token ID="w1">wie</token>
<token ID="w2">oede</token>
<token ID="w3">!</token>
</tokens>
<sentences>
<sentence ID="s1" tokenIDs="w1 w2 w3"/>
</sentences>
<lemmas>
<lemma tokenIDs="w1">wie</lemma>
<lemma tokenIDs="w2">öde</lemma>
<lemma tokenIDs="w3">!</lemma>
</lemmas>
<POStags tagset="stts">
<tag tokenIDs="w1">PWAV</tag>
<tag tokenIDs="w2">ADJD</tag>
<tag tokenIDs="w3">$.</tag>
</POStags>
<orthography>
<correction tokenIDs="w2" operation="replace">öde</correction>
</orthography>
</TextCorpus>
</D-Spin>
If the input contains a 'text' layer but no 'tokens' or 'sentences' layers, the 'text' layer will be tokenized using the DTA::CAB::Format::Raw class.
Bryan Jurish <moocow@cpan.org>
Copyright (C) 2015-2019 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.
dta-cab-analyze.perl(1), dta-cab-convert.perl(1), dta-cab-http-server.perl(1), dta-cab-http-client.perl(1), dta-cab-xmlrpc-server.perl(1), dta-cab-xmlrpc-client.perl(1), DTA::CAB::Server(3pm), DTA::CAB::Client(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...