dtatw-sanitize-header.perl - make DDC/DTA-friendly TEI-headers
dtatw-sanitize-header.perl [OPTIONS] XML_HEADER_FILE
General Options:
-help # this help message
-verbose LEVEL # set verbosity level (0<=LEVEL<=1)
-quiet # alias for -verbose=0
-dta , -foreign # do/don't warn about strict DTA header compliance (default=do)
-max-bibl-length LEN # trim bibl fields to maximum length LEN (default=256)
Auxiliary DB Options: # optional BASENAME-keyed JSON-metadata Berkeley DB
-aux-db DBFILE # read auxiliary DB from DBFILE (default=none)
-aux-xpath XPATH # append <idno type="KEY"> elements to XPATH (default='fileDesc[@n="ddc-aux"]')
XPath Options:
-xpath ATTR=XPATH # prepend XPATH for attribute ATTR
-default ATTR=VAL # default values (for textClass* attributes)
I/O Options:
-blanks , -noblanks # do/don't keep 'ignorable' whitespace in XML_HEADER_FILE file (default=don't)
-base BASENAME # use BASENAME to auto-compute field names (default=basename(XML_HEADER_FILE))
-output FILE # specify output file (default='-' (STDOUT))
Display a brief usage summary and exit.
Set verbosity level; values for LEVEL are:
0: silent
1: warnings only
2: warnings and progress messages
Alis for -verbose=0
Set basename for generated header fields; default is the basename (non-directory portion) of XML_HEADER_FILE up to but not including the first dot (".") character, if any. In default -dta
mode, everything after the first dot character in BASENAME will be truncated even if you specify this option; in -foreign
mode, dots in basenames passed in via this option are allowed.
Do/don't run with DTA-specific heuristics and attempt to enforce DTA-header compliance (default: do).
Alias for -nodta
.
Trim sanitized XPaths to maximum length LEN characters (default=256).
You can optionally use a BASENAME-keyed JSON-metadata Berkeley DB file to automatically insert additional metadata fields into an existing header.
Apply auxiliary metadata from Berkeley DB file DBFILE (default=none). Keys of DBFILE should be BASENAMEs as parsed from XML_HEADER_FILE or passed in via the -basename
option, and the associated values should be flat JSON objects whose keys are the names of metadata attributes for BASENAME and whose values are the values of those metadata attributes.
Append <idno type="KEY">VAL</idno>
elements to XPATH (default='fileDesc[@n="ddc-aux"]'
) for auxiliary metadata attributes.
You can optionally specify source XPaths to override the defaults with the -xpath
option.
Prepend XPATH to the builtin list of source XPaths for the attribute ATTR. Known attributes: author title date bibl shelfmark library dirname dtaid timestamp availability avail textClassDTA textClassDWDS textClassCorpus.
Default value for attribute ATTR. Only used for textClass* attributes.
Do/don't retain all whitespace in input file (default=don't).
Write output to OUTFILE; default="-" (standard output).
Format output at libxml level LEVEL (default=1).
dtatw-sanitize-header.perl applies some parsing and encoding heuristics to a TEI-XML header file XML_HEADER_FILE in an attempt to ensure compliance with DTA/D* header conventions for subsequent DDC indexing. For each supported metadata attribute, a corresponding header record is first sought by means of a first-match-wins XPath list. If no existing header record is found, a default (possibly empty) value is heuristically assigned, and the resulting value is inserted into the header at a conventional XPath location.
The metadata attributes currently supported are listed below; Source XPaths in the list are specified relative to the root <teiHeader>
element, and unless otherwise noted, the first source XPath listed is also the target XPath, guaranteed to be exist in the output header on successful script completion.
See http://kaskade.dwds.de/dstar/doc/README.html#bibliographic_metadata_attributes for details on D* metadata attribute conventions.
XPath(s):
fileDesc/titleStmt/author[@n="ddc"] ##-- ddc: canonical target (formatted)
fileDesc/titleStmt/author ##-- new (direct, un-formatted)
fileDesc/sourceDesc/biblFull/titleStmt/author ##-- new (sourceDesc, un-formatted)
fileDesc/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (direct, un-formatted)
fileDesc/sourceDesc/biblFull/titleStmt/editor[string(@corresp)!="#DTACorpusPublisher"] ##-- new (sourceDesc, un-formatted)
fileDesc/sourceDesc/listPerson[@type="searchNames"]/person/persName ##-- old
Heuristically parses and formats persName
, surname
, forename
, and genName
elements to a human-readable string. In DTA mode, defaults to the first component of the "_"-separated BASENAME.
XPath(s):
fileDesc/titleStmt/title[@type="main" or @type="sub" or @type="vol"] ##-- DTA-mode only
fileDesc/titleStmt/title[@type="ddc"] ##-- ddc: canonical target (formatted)
fileDesc/titleStmt/title[not(@type)]
sourceDesc[@id="orig"]/biblFull/titleStmt/title
sourceDesc[@id="scan"]/biblFull/titleStmt/title
sourceDesc[not(@id)]/biblFull/titleStmt/title
In DTA mode, heuristically parses and formats @type="main"
, @type="sub"
, @type="vol"
elements to a human-readable string, and defaults to the second component of the "_"-separated BASENAME.
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="pub"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="scan"]/biblFull/publicationStmt/date ##-- old:publDate
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"]/supplied ##-- new:date (published, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="publication"] ##-- new:date (published)
fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. In DTA mode, defaults to the final component of the "_"-separated BASENAME.
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/biblFull/publicationStmt/date[@type="first"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="orig"]/biblFull/publicationStmt/date ##-- old: publDate
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]/supplied
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="creation"]
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"]/supplied ##-- new:date (first, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date[@type="firstPublication"] ##-- new:date (first)
fileDesc/sourceDesc/biblFull/publicationStmt/date/supplied ##-- new:date (generic, supplied)
fileDesc/sourceDesc/biblFull/publicationStmt/date ##-- new:date (generic, supplied)
Heuristically trims everything but digits and hyphens from the extracted date-string. Defaults to the publication date (see above).
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/bibl ##-- ddc:canonical target
fileDesc/sourceDesc[@n="orig"]/bibl ##-- old:firstBibl, target
fileDesc/sourceDesc[@n="scan"]/bibl ##-- old:publBibl
fileDesc/sourceDesc/bibl ##-- new|old:generic
Heuristically generated from author, title, and date if not set. Ensures that the first 2 XPaths are set in the output file.
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno/idno[@type="shelfmark"] ##-- ddc: canonical target
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- -2013-08-04
fileDesc/sourceDesc/msDesc/msIdentifier/idno/idno[@type="shelfmark"]
fileDesc/sourceDesc/msDesc/msIdentifier/idno[@type="shelfmark"] ##-- new (>=2012-07)
fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/ident[@type="shelfmark"] ##-- old (<2012-07)
XPath(s):
fileDesc/sourceDesc[@n="ddc"]/msDesc/msIdentifier/repository ##-- ddc: canonical target
fileDesc/sourceDesc/msDesc/msIdentifier/repository ##-- new
fileDesc/sourceDesc/biblFull/notesStmt/note[@type="location"]/name[@type="repository"] ##-- old
XPath(s):
fileDesc/publicationStmt[@n="ddc"]/idno[@type="basename"] ##-- new: canonical target
fileDesc/publicationStmt/idno/idno[@type="DTADirName"] ##-- (>=2013-09-04)
fileDesc/publicationStmt/idno[@type="DTADirName"] ##-- (>=2013-09-04)
fileDesc/publicationStmt/idno[@type="DTADIRNAME"] ##-- new (>=2012-07)
fileDesc/publicationStmt/idno[@type="DTADIR"] ##-- old (<2012-07)
Heuristically set to BASENAME if not found.
XPath(s):
fileDesc/publicationStmt[@n="ddc"]/idno[@type="dtaid"] ##-- ddc: canonical target
fileDesc/publicationStmt/idno/idno[@type="DTAID"]
fileDesc/publicationStmt/idno[@type="DTAID"]
Defaults to "0" (zero) if unset.
XPath(s):
fileDesc/publicationStmt/date[@type="ddc-timestamp"] ##-- ddc: canonical target
fileDesc/publicationStmt/date ##-- DTA mode only
Defaults to last modification time of XML_HEADER_FILE or the current time if not set.
XPath(s):
fileDesc/publicationStmt/availability[@type="ddc"]
fileDesc/publicationStmt/availability
Defaults to "-" if unset.
XPath(s):
fileDesc/publicationStmt/availability[@type="ddc_dwds"]
fileDesc/publicationStmt/availability/@n
Defaults to "-" if unset.
Source XPath(s):
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1main"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2main"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dwds2sub"]
profileDesc/textClass/keywords/term ##-- dwds keywords
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassDWDS"]
Source XPath(s):
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtamain"]
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#dtasub"]
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassDTA"]
Source XPath(s):
profileDesc/textClass/classCode[@scheme="http://www.deutschestextarchiv.de/doku/klassifikation#DTACorpus"]
Target XPath:
profileDesc/textClass/classCode[@scheme="ddcTextClassCorpus"]
Bryan Jurish <jurish@bbaw.de>