DTA::CAB::Format - Base class for DTA::CAB::Datum I/O
use DTA::CAB::Format;
##========================================================================
## Constructors etc.
$fmt = $CLASS_OR_OBJ->new(%args);
$fmt = $CLASS->newFormat($class_or_class_suffix, %opts);
$fmt = $CLASS->newReader(%opts);
$fmt = $CLASS->newWriter(%opts);
##========================================================================
## Methods: Global Format Registry
\%classReg_or_undef = $CLASS_OR_OBJ->registerFormat(%classRegOptions);
\%classReg_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);
$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);
$class_or_undef = $CLASS_OR_OBJ->shortReaderClass($shortname);
$class_or_undef = $CLASS_OR_OBJ->shortWriterClass($shortname);
$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);
##========================================================================
## Methods: Persistence
@keys = $class_or_obj->noSaveKeys();
##========================================================================
## Methods: MIME
$short = $fmt->shortName();
$type = $fmt->mimeType();
$ext = $fmt->defaultExtension();
##========================================================================
## Methods: Input
$fmt = $fmt->close();
$fmt = $fmt->fromString(\$string);
$fmt = $fmt->fromFile($filename);
$fmt = $fmt->fromFh($fh);
$doc = $fmt->parseDocument();
$doc = $fmt->parseString(\$str);
$doc = $fmt->parseFile($filename);
$doc = $fmt->parseFh($fh);
$doc = $fmt->forceDocument($reference);
##========================================================================
## Methods: Output
$lvl = $fmt->formatLevel();
$fmt = $fmt->flush();
$fmt_or_undef = $fmt->toString(\$str, $formatLevel);
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);
$fmt_or_undef = $fmt->toFh($fh, $formatLevel);
$fmt = $fmt->putDocument($doc);
$fmt = $fmt->putDocumentRaw($doc);
DTA::CAB::Format is an abstract base class and API specification for objects implementing an I/O format for the DTA::CAB::Datum subhierarchy in general, and for DTA::CAB::Document objects in particular.
Each I/O format (subclass) has a characteristic abstract `base class' as well as optional `reader' and `writer' subclasses which perform the actual I/O (although in the current implementation, all reader/writer classes are identical with their respective base classes). Individual formats may be invoked either directly by their respective classes (SUBCLASS->new(), etc.), or by means of the global DTA::CAB::Format::Registry object $REG ("registerFormat", "newFormat", "newReader", "newWriter", etc.).
See "SUBCLASSES" for a list of common built-in formats and their registry data.
DTA::CAB::Format inherits from DTA::CAB::Persistent and DTA::CAB::Logger.
Default class returned by "newFormat"() if no known class is specified.
Default global format registry used, a DTA::CAB::Format::Registry object used by "registerFormat", "newFormat", etc.
$fmt = CLASS_OR_OBJ->new(%args);
Constructor.
%args, %$fmt:
##-- DTA::CAB::Format: common
##
##-- DTA::CAB::Format: input parsing
#(none)
##
##-- DTA::CAB::Format: output formatting
level => $formatLevel, ##-- formatting level, where applicable
outbuf => $stringBuffer, ##-- output buffer, where applicable
$fmt = CLASS->newFormat($class_or_class_suffix, %opts);
Wrapper for "new"() which allows short class suffixes to be passed in as format names.
$fmt = CLASS->newReader(%opts);
Wrapper for DTA::CAB::Format::Registry::newReader which accepts %opts:
class => $class, ##-- classname or DTA::CAB::Format:: suffix
file => $filename, ##-- attempt to guess format from filename
$fmt = CLASS->newWriter(%opts);
Wrapper for DTA::CAB::Format::Registry::newWriter which accepts %opts:
class => $class, ##-- classname or DTA::CAB::Format:: suffix
file => $filename, ##-- attempt to guess format from filename
The global format registry lives in the package variable $REG. The following methods are backwards-compatible wrappers for method calls to this registry object.
\%registered = $CLASS_OR_OBJ->registerFormat(%opts);
Registers a new format subclass; wrapper for DTA::CAB::Format::Registry::register().
\%registered_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);
Returns registration record for most recently registered format subclass whose filenameRegex
matches $filename. Wrapper for DTA::CAB::Format::Registry::guessFilenameFormat().
$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);
Attempts to guess reader class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileReaderClass().
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);
Attempts to guess writer class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileWriterClass().
$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);
Gets the most recent subclass registry HASH ref for the short class name $shortname. Wrapper for DTA::CAB::Format::Registry::short2reg().
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);
Gets the most recent subclass registry HASH ref for the claass basename name $basename. Wrapper for DTA::CAB::Format::Registry::base2reg().
@keys = $class_or_obj->noSaveKeys();
Returns list of keys not to be saved This implementation ignores the key outbuf
, which is used by some many writer subclasses.
$short = $fmt->shortName();
Get short name for $fmt. Default just returns lower-cased DTA::CAB::Format:: class suffix. Short names are all lower-case by default.
$type = $fmt->mimeType();
Returns MIME type for $fmt. Default returns 'text/plain'.
$ext = $fmt->defaultExtension();
Returns default filename extension for $fmt (default='.cab').
$fmt = $fmt->close();
$fmt = $fmt->close($savetmp);
Close current input source, if any. Default implementation calls $fmt->{tmpfh}->close() iff available and $savetmp is false (default). Always deletes @$fmt{qw(fh doc)}.
$fmt = $fmt->fromString(\$string);
Select input from the string $string. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh).
$fmt = $fmt->fromFile($filename);
Select input from file $filename. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh)().
$fmt = $fmt->fromFh($fh);
Select input from open filehandle $fh. Default implementation just calls $fmt->close(1) and sets $fmt->{fh}=$fh.
$fmt = $fmt->fromFh_str($handle);
Alternate fromFh() implementation which slurps contents of $fh and calls $fmt->fromString(\$str).
$doc = $fmt->parseDocument();
Parse document from currently selected input source.
$doc = $fmt->parseString($str);
Wrapper for $fmt->fromString($str)->parseDocument().
$doc = $fmt->parseFile($filename_or_fh);
Wrapper for $fmt->fromFile($filename_or_fh)->parseDocument()
$doc = $fmt->parseFh($fh);
Wrapper for $fmt->fromFh($filename_or_fh)->parseDocument()
$doc = $fmt->forceDocument($reference);
Attempt to tweak $reference into a DTA::CAB::Document. This is a slightly more in-depth version of DTA::CAB::Datum::toDocument(). Current supported $reference forms are:
returned literally
returns a new document with a single sentence $reference.
returns a new document with a single token $reference.
returns a new document with a single token whose 'text' key is $reference.
returns a bless()ed $reference as a DTA::CAB::Document.
returns a new document with the single sentence $reference
returns a new document with the single token $reference
returns a new document with a single sentence whose 'tokens' field is set to $reference.
will cause a warning to be emitted and $reference to be returned as-is.
$lvl = $fmt->formatLevel();
$fmt = $fmt->formatLevel($level)
Get/set output formatting level.
$fmt = $fmt->flush();
Flush any buffered output to selected output source. Default implementation deletes $fmt->{outbuf} and calls $fmt->{fh}->flush() if available.
$fmt = $fmt->toString(\$str);
$fmt = $fmt->toString(\$str,$formatLevel)
Select output to byte-string $str. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).
$fmt_or_undef = $fmt->toString_buf(\$str)
Alternate toString() implementation which sets $str=$fmt->{outbuf}.
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);
Select output to named file $filename. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).
$fmt_or_undef = $fmt->toFh($fh,$formatLevel);
Select output to an open filehandle $fh. Default implementation just calls $fmt->formatLevel($level) and sets $fmt->{fh}=$fh.
$fmt = $fmt->putToken($tok);
Append a token to the selected output sink.
Should be non-destructive for $tok.
No default implementation, but default implementations of other methods assume output is concatenated onto $fmt->{outbuf}.
$fmt = $fmt->putTokenRaw($tok)
Copy-by-reference version of "putToken". Default implementation just calls $fmt->putToken($tok).
$fmt = $fmt->putSentence($sent)
Append a sentence to the selected output sink.
Should be non-destructive for $sent.
Default implementation just iterates $fmt->putToken() & appends 1 additional "\n" to $fmt->{outbuf}.
$fmt = $fmt->putSentenceRaw($sent)
Copy-by-reference version of "putSentence". Default implementation just calls "putSentence".
$fmt = $fmt->putDocument($doc);
Append document contents to the selected output sink.
Should be non-destructive for $doc.
Default implementation just iterates $fmt->putSentence()
$fmt = $fmt->putDocumentRaw($doc);
Copy-by-reference version of "putDocument".
The following formats are provided by the default distribution. In some cases, external dependencies are also required which may not be available on all systems.
Just a convenience package: load all built-in DTA::CAB::Format subclasses.
Formatter for runtime term expansion, for use e.g. with DDC Cab Expander, registerd as:
name=>__PACKAGE__, short=>'xl', filenameRegex=>qr/\.(?i:xl|xlist|l|lst)$/
Datum parser|formatter for "vertical" text conforming to the CONLL-U
format, with optional special handling for additional MISC
fields, including json=JSON
for embedding DTA::CAB::Format::TJ CAB-token structure. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:conllu|conll[_-]u|cab[\.-]connlu|cab[\.-]conll[\.-]u)$/
Aliases: conllu conll-u cab-conllu cab-conll-u
Abstract datum parser|formatter for JSON I/O. Transparently wraps one of the DTA::CAB::Format::JSON::XS or DTA::CAB::Format::JSON::Syck classes, depending on the availability of the underlying Perl modules (JSON::XS and JSON::Syck, respectively). If you have the JSON::XS module installed, this module provides the fastest I/O of all available human-readable format classes. Registered as:
name=>__PACKAGE__, short=>'json', filenameRegex=>qr/\.(?i:json|jsn)$/
Formatter for runtime term lemmatization, for use e.g. with DDC Cab Expander. By default, returns all lemmata for function word input tokens (whose tag matches the regex /^(?:[CKP\$]|A[PR]|V[AM])/
), otherwise only the "best" lemma. Regisered as:
(name=>__PACKAGE__, short=>$_, filenameRegex=>qr/\.(?i:ll|llist|lemmas|lemmata)/)
foreach (qw(LemmaList llist ll lemma))
A variant which returns all known lemmata for each input token is registered as:
(name=>__PACKAGE__, short=>$_, opts=>{cctagre=>''})
foreach (qw(LemmaListAll LemmasAll llist-all ll-all lla lemmas lemmata))
Null-op parser/formatter for debugging and testing purposes. Registered as:
name=>__PACKAGE__
Datum parser|formatter: perl code via Data::Dumper, eval(). Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:prl|pl|perl|dump)$/
Abstract only format for reading raw untokenized text and writing simple flat list of canonical forms; wraps DTA::CAB::Format::Raw::Waste by default. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:raw)$/
Input-only format for reading raw untokenized text and analyzing it over HTTP using a remote WASTE FastCGI interface, registered as:
name=>__PACKAGE__, short=>'raw-http', filenameRegex=>qr/\.(?i:raw-http|txt-http)$/
Input-only format for reading raw untokenized text and analyzing it using simple pure-perl heuristics. Registered as:
name=>__PACKAGE__, short=>'raw-perl', filenameRegex=>qr/\.(?i:raw-perl|txt-perl)$/
Input-only format for reading raw untokenized text and analyzing it using the Moot::Waste module, registered as:
name=>__PACKAGE__, short=>'raw-waste', filenameRegex=>qr/\.(?i:raw-waste|txt-waste)$/
Binary datum parser|formatter using the Storable module. Very fast, but neither human-readable nor easily portable beyond Perl. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:sto|bin)$/
Datum parser|formatter for SynCoPe named entity recognizer -tab_input
mode. Registered as:
name=>__PACKAGE__, short=>'syncope-csv', filenameRegex=>qr/\.(?i:syn(?:cope)?[-\.](?:csv|tsv|tab)|)$/
Datum parser|formatter for CLARIN-D TCF XML. Handles annoation layers tokens, sentences, orthography, postags, and lemmas. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:tcf[\.\-_]?xml)|(?:tcf))$/)
(name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography'}) foreach (qw(tcf-orth tcf-web))
(name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography postags lemmas'}) foreach (qw(tcf tcf-xml tcfxml full-tcf xtcf))
Datum parser|formatter: for raw un-tokenized TEI XML (with or without //c elements) using DTA::TokWrap. Any //s or //w elements in the input will be IGNORED and input will be (re-)tokenized. Outputs files are themselves parseable by DTA::CAB::Format::TEIws. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:c|chr|txt|tei(?:[\.\-_]?p[45])?)[\.\-_]xml|xml)$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(chr-xml c-xml cxml tei-xml teixml tei xml))
By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following additional aliases are provided for using the DTA::CAB::Format::XmlTokWrapFast module to format the low-level flat token data (faster but not as flexible as the default):
(name=>__PACKAGE__, short=>$_, opts=>{txmlfmt=>'DTA::CAB::Format::XmlTokWrapFast'})
foreach (qw(fast-tei-xml ftei-xml fteixml ftei))
Additionally, the following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:
(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1})
foreach (qw(ling-tei-xml ltei-xml lteixml ltei tei-ling tei+ling teiling))
Datum parser|formatter: for TEI XML pre-tokenized into (possibly fragmented) //w and //s elements, as output by DTA::TokWrap. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:spliced|tei[\.\-\+]?ws?|wst?)[\.\-]xml)$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(tei-ws tei+ws tei+w tei-w teiw wst-xml wstxml teiws-xml));
By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:
(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1})
foreach (qw(lteiws teilws teiwsl ltei-ws ltei+ws tei+w ltei-w lteiw lwst-xml lwstxml lteiws-xml),
qw(ling-tei-ws tei+ling+ws tei+ws+ling teiws-ling-xml teiws+ling-xml))
Datum parser|formatter: verbose human-readable text Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:txt|text|cab\-txt|cab\-text)$/
Datum parser|formatter: "vertical" text, one token per line, with a single TAB-separated attribute field encoding token data as JSON. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:tj|tjson|cab\-tj|cab\-tjson)$/);
Datum parser|formatter: "vertical" text, one token per line, TAB-separated attribute fields with conventional attribute-name prefixes. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:t|tt|ttt|cab\-t|cab\-tt|cab\-ttt)$/
Abstract datum parser|formatter for YAML I/O. Transparently wraps one of the DTA::CAB::Format::YAML::XS, DTA::CAB::Format::YAML::Syck, or DTA::CAB::Format::YAML::Lite classes, depending on the availability of the underlying Perl modules (YAML::XS, YAML::Syck, and YAML::Lite, respectively). Registered as:
name=>__PACKAGE__, short=>'yaml', filenameRegex=>qr/\.(?i:yaml|yml)$/
Datum parser|formatter: XML: abstract base class.
Datum parser|formatter: minimalistic flat TokWrap-like XML using only TEI att.linguistic attributes. Based on DTA::CAB::Format::XmlTokWrapFast, the XmlLing parser reads and writes only IDs and the TEI att.linguistic attributes, (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html)). Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:(?:ling|l[tuws])(?:\.?)xml))$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(ltxml lxml ling-xml lt-xml ltwxml ltw-xml))
Datum parser|formatter: XML (native). Nearly compatible with .t.xml
files as created by dta-tokwrap.perl(1). Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml\-native|xml\-dta\-cab|(?:dta[\-\._]cab[\-\._]xml)|xml)$/
and aliased as:
name=>__PACKAGE__, short=>'xml'
Datum parser|formatter: XML (perl-like). Not really reccommended. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)perl|perl(?:[\-\.]?)xml)$/
Datum parser|formatter: XML-RPC data structures using RPC::XML. Much too bloated to be of any real practical use. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)rpc|rpc(?:[\-\.]?)xml)$/
Datum parser|formatter(s): XML as read/written by DTA::TokWrap.
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:[tuws]\.?xml)$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(txml t-xml twxml tw-xml))
Datum parser|formatter(s): XML as read/written by DTA::TokWrap. Unlike the XmlTokWrap
format, the XmlTokWrapFast class does not read and/or write the full document structure, but rather restricts itself to a finite hard-coded subset of the most commonly used document-, sentence-, and token-level attributes. The input parser uses the expat-based XML::Parser module, which usually results in much faster and memory-friendlier document parsing than offered by the XmlTokWrap class. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:f[tuws](?:\.?)xml))$/);
(name=>__PACKAGE__, short=>$_) foreach (qw(ftxml ft-xml ftwxml ftw-xml))
Bryan Jurish <moocow@cpan.org>
Copyright (C) 2009-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.