Taxi::Mysql::Loader - extendable full-text index using mysql: document loader


NAME

Taxi::Mysql::Loader - extendable full-text index using mysql: document loader

(Back to Top)


SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 use Taxi::Mysql::Loader;
 ##========================================================================
 ## Constructors etc.
 $ldr = $CLASS_OR_OBJ->new(%args);
 $ldr = $ldr->clearData();
 $ldr = $ldr->clearDocumentData();
 ##========================================================================
 ## API: high-level: document parsing
 $ldr_or_undef = $ldr->parseString($srcXmlString);
 $ldr_or_undef = $ldr->parseFile($srcXmlFilename_or_fh);
 $ldr_or_undef = $ldr->parseDocument($srcDoc);
 $ldr = $ldr->prepare()
 $ldr = $ldr->finish();
 ##========================================================================
 ## API: high-level: document upload
 $ldr_or_undef = $ldr->parseAndUpload(@xml_filenames);
 ##========================================================================
 ## API: XML Parsing: Objects
 $parser = $ldr->parser();
 $xslt   = $ldr->xslt();
 $str    = $ldr->xslStr();
 $doc    = $ldr->xslDoc();
 $style  = $ldr->xslStyle();
 ##========================================================================
 ## API: XML Parsing: default stylesheet
 $xsl_fragment = $CLASS_OR_OBJ->xsl_ns_fragment();
 $xsl_str      = $ldr->defaultXslStr();
 ##========================================================================
 ## API: XML Parsing: XSL Functions
 \&closure = $ldr->xsl_func_filename();
 \&closure = $ldr->xsl_func_parseRow($tabName,$rowKey,%colName2Value);
 \&closure = $ldr->xsl_func_tolower();
 ##========================================================================
 ## API: Reference expansion
 $ldr = $ldr->expandData();
 ##========================================================================
 ## API: text file output
 $filename = $ldr->tableDataFilename($tabName);
 $ldr = $ldr->unlinkDataFiles();
 $ldr = $ldr->unlinkTableDataFile($tabName);
 $ldr = $ldr->truncateDataFiles();
 $ldr = $ldr->truncateTableDataFile($tabName);
 $ldr = $ldr->writeData();
 $ldr = $ldr->appendData();
 $ldr = $ldr->appendDocumentData;
 $ldr = $ldr->flushDocumentData();
 $ldr = $ldr->appendTableData($tabName);
 ##========================================================================
 ## API: upload to server
 %loadDataArgs = $ldr->loadDataArgs(%user_args);
 $bool         = $ldr->uploadDataFiles(%user_loadDataArgs);

(Back to Top)


DESCRIPTION

Taxi::Mysql::Loader is a class for parsing index-relevant information from an input corpus of XML documents, performing any preprocessing required on a set of generated text files, and uploading generated text files to a backend MySQL server.

Globals etc.

Variable: @ISA

Taxi::Mysql::Loader inherits from Taxi::Mysql::Base.

Constructors etc.

new
 $ldr = $CLASS_OR_OBJ->new(%args);

Object structure / recognized %args:

   {
    ##-- Source index
    index  => $index,         ##-- Taxi::Mysql object being loaded
    ##-- Text file I/O
    data_dir => $text_dir,    ##-- directory to save text files (default='.')
    data_ext => $extension,   ##-- text file extension (default='.dat')
    data_enc => $encoding,    ##-- data file encoding (default=$index->{dbEncoding})
    ##-- Document parsing
    parser => $xml_libxml,    ##-- see $ldr->parser()
    xslt   => $xml_libxslt,   ##-- see $ldr->xslt()
    xsl_style => $xsl_style,  ##-- XSL stylesheet (see $ldr->xslStyle())
    xsl_doc => $xsl_doc,      ##-- XSL doc (see $ldr->xslDoc())
    xsl_str => $xsl_str,      ##-- XSL source string (see $ldr->xslStr())
    ##-- dynamic data
    xsl_filename_value => $filename, ##-- for the XSL Perl.Taxi.Mysql.Loader:filename() function
    ##-- Parsed data
    data    => { $tableName=>\%tableRows, ... }, ##-- parsed tables
    maxid   => { $tableName=>$maxId, ... },      ##-- maximum numeric Ids for each table
   }
clearData
 $ldr = $ldr->clearData();

Clears all parsed data from the object.

clearDocumentData
 $ldr = $ldr->clearDocumentData();

Clears any document-local data from the object.

API: high-level: document parsing

parseString
 $ldr_or_undef = $ldr->parseString($srcXmlString)
 $ldr_or_undef = $ldr->parseString($srcXmlString, $srcName)

Parse an XML source document from a perl string. Calls parseDocument().

parseFile
 $ldr_or_undef = $ldr->parseFile($srcXmlFilename_or_fh);
 $ldr_or_undef = $ldr->parseFile($srcXmlFilename_or_fh, $srcName)

Parse an XML source document from a named file or perl filehandle. Calls parseDocument().

parseDocument
 $ldr_or_undef = $ldr->parseDocument($srcDoc);
 $ldr_or_undef = $ldr->parseDocument($srcDoc, $srcName)

Parse an XML source document from an in-memory XML::LibXML::Document object.

prepare
 $ldr = $ldr->prepare()

User hook to prepare loader for parsing documents. Default implementation does nothing.

finish
 $ldr = $ldr->finish();

Finish writing all data files and perform any post-processing required on the generated data. Default implementation calls the appendData(), clearData(), and analyzeDataFiles() methods.

API: high-level: document upload

parseAndUpload
 $ldr_or_undef = $ldr->parseAndUpload(@xml_filenames);

High-level method to parse and upload all files specified in @xml_filenames.

API: XML Parsing: Objects

parser
 $parser = $ldr->parser();

Underlying XML::LibXML object (parser): $ldr->{parser} or new object.

xslt
 $xslt = $ldr->xslt();

Underlying XML::LibXSLT object: $ldr->{xslt} or new object.

xslStr
 $str = $ldr->xslStr();

XSL Stylesheet string to be used for document parsing: $ldr->{xsl_str} or auto-generated string.

xslDoc
 $doc = $ldr->xslDoc();

XML::LibXML::Document object representing the XSL Stylesheet to be used for document parsing: $ldr->{xsl_doc} or $ldr->parser->parse_string($ldr->xslStr()).

xslStyle
 $style = $ldr->xslStyle();

XML::LibXSLT::Stylesheet object representing the stylesheet to be used for document parsing: $ldr->{xsl_style} or $ldr->xslt->parse_stylesheet($ldr->xslDoc()).

API: XML Parsing: default stylesheet

xsl_ns_fragment
 $xsl_fragment = $CLASS_OR_OBJ->xsl_ns_fragment();

Namespace fragment for auto-generated stylesheet. This should include the string returned by the default implementation, otherwise things are likely to go horribly wrong.

defaultXslStr
 $xsl_str = $ldr->defaultXslStr();

Generates and returns an XSL stylesheet string for parsing input documents. The default stylesheet is generated based on the 'xpath' keys of all Taxi::Mysql::Table objects in the {tables} hash of the underlying index.

API: XML Parsing: XSL Functions

xsl_func_filename
 \&closure = $ldr->xsl_func_filename();

Returns a closure suitable for binding into the XSL namespace, which should return the name of the current input source. Default version just returns $ldr->{xsl_filename_value}.

xsl_func_parseRow
 \&closure = $ldr->xsl_func_parseRow($tabName,$rowKey,%colName2Value);

Returns a closure suitable for binding into the XSL namespace, which should perform whatever actions are necessary to enqueue a row from $tabName with unique ID $rowKey and attributes %colName2Value..

The default version gets numeric value for $rowKey, inserting a new row for $rowKey into $ldr->{data}{$tabName} if none was present already.

References are not expanded here, just primary keys!

API: Reference expansion

expandData
 $ldr = $ldr->expandData();

Expands 'ref' column values in $ldr->{data} from string-values to numeric ID-values, in preparation for flushing to text file(s).

API: text file output

tableDataFilename
 $filename = $ldr->tableDataFilename($tabName);

Returns name of the text file for storing data for $tabName, based on loader arguments.

unlinkDataFiles
 $ldr = $ldr->unlinkDataFiles();

Cleanup method: removes all table data (text) files.

unlinkTableDataFile
 $ldr = $ldr->unlinkTableDataFile($tabName);

Cleanup: removes table data file for $tabName.

truncateDataFiles
 $ldr = $ldr->truncateDataFiles();

Preparation: truncates all table data files.

truncateTableDataFile
 $ldr = $ldr->truncateTableDataFile($tabName);

Preparation: truncates table data file for $tabName.

writeData
 $ldr = $ldr->writeData();

Wrapper for truncateDataFiles() and appendData(). Really only useful if everything you need to parse and load fits nicely into memory.

appendData
 $ldr = $ldr->appendData();

Append the contents of $ldr->{data} for all tables to the respective text files.

appendDocumentData
 $ldr = $ldr->appendDocumentData;

Like appendData(), but appends only document-local data (data for non-delayed tables).

flushDocumentData
 $ldr = $ldr->flushDocumentData();

Appends & flushes document-local data.

appendTableData
 $ldr = $ldr->appendTableData($tabName);

Appends data for a single table.

API: upload to server

loadDataArgs
 %loadDataArgs = $ldr->loadDataArgs(%user_args);

Compatibility hack for loadData() variants in other Taxi::Mysql classes.

uploadDataFiles
 $bool = $ldr->uploadDataFiles(%user_loadDataArgs);

Uploads current data files to backend server.

(Back to Top)


ACKNOWLEDGEMENTS

Perl by Larry Wall.

(Back to Top)


AUTHOR

Bryan Jurish <moocow@ling.uni-potsdam.de>

(Back to Top)


COPYRIGHT AND LICENSE

Copyright (C) 2006 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.7 or, at your option, any later version of Perl 5 you may have available.

(Back to Top)


SEE ALSO

perl(1), Taxi::Mysql(3perl), Taxi::Mysql::Table(3perl).

(Back to Top)

 Taxi::Mysql::Loader - extendable full-text index using mysql: document loader