Public Member Functions | Public Attributes | Private Types | Private Member Functions | Private Attributes

CConcIndexator Class Reference

#include <ConcIndexator.h>

Inheritance diagram for CConcIndexator:
Inheritance graph
[legend]
Collaboration diagram for CConcIndexator:
Collaboration graph
[legend]

List of all members.

Public Member Functions

Public Attributes

Private Types

Private Member Functions

Private Attributes


Detailed Description

CConcIndexator is the central class of DDC technology. The most of its slots come from the two parent classes:CStringIndexator (indexing tokens and its properties) and CHitBorders (indexing corpus divisions) This class also contains a list of corpus files and some indexing and querying options.


Member Enumeration Documentation

enum DDCIndexTypeEnum contains index types. Each index type determines DDC indices and break collections.

Enumerator:
DWDS_Index 

A type for corpus without annotations, which are written for each word. Fr example the input text can be a plain text. DDC always builds a token index and a file break collection for this index type. Optionally DDC can build "Thes" index, "Morph" index and a sentence collection.

MorphXML_Index 

A type for xml-texts, if their words have predefined and written annotations. DDC always builds a token index and a "MorphPattern" index. It also creates a file and a sentence break collection.

Free_Index 

This index type is free and therefore it should be defined in the options file (fields "Indices" and "HitBorders"). The corpus should consists of xml-files with a bibliographical header and a body (text). The text is written in CWB format (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CWBTutorial/cwb-tutorial.pdf). The original CWB format was changed in the following way. Instead of line breaks which are used to delimit records in the input file, DDC uses a special tag CConcCommon.h::PredefinedTableLineTag. This is done because line breaks are not preserved by the XML-parser.


Constructor & Destructor Documentation

CConcIndexator::CConcIndexator (  ) 

References InitDefaultOptions().

Here is the call graph for this function:

CConcIndexator::~CConcIndexator (  ) 

Member Function Documentation

bool CConcIndexator::IndexTextOrHtmlFile ( CGraphmatFile piGraphmat,
string  FileName,
const char *  pFileBuffer,
const CDwdsThesaurus pDwdsThesaurus,
CTokenNo NewCorpusEndTokenNo,
string &  strError 
) [private]
bool CConcIndexator::IndexMorphXml ( string  FileName,
const char *  pFileBuffer,
CTokenNo NewCorpusEndTokenNo,
string &  strError 
) [private]
bool CConcIndexator::IndexTable ( string  FileName,
const char *  pFileBuffer,
CTokenNo NewCorpusEndTokenNo,
string &  strError 
) [private]
bool CConcIndexator::IndexOneTableTextArea ( const string &  Text,
const CPageNumber StartPageFromHeader,
size_t &  page_breaks_count,
CTokenNo NewCorpusEndTokenNo,
string &  strError 
) [private]
void CConcIndexator::AssertHasPath (  )  const [private]

References ErrorMessage(), errUnknownPath, and CStringIndexator::m_Path.

Referenced by DestroyIndex(), and LoadProject().

Here is the call graph for this function:

Here is the caller graph for this function:

string CConcIndexator::GetBiblIndexFileName (  )  const [private]
string CConcIndexator::GetBiblFileName (  )  const [private]
string CConcIndexator::SaveOptionsToString (  )  const [private]
bool CConcIndexator::LoadOptionsFromString ( string  Options  )  [private]

loads options from a string

References DWDS_Index, ErrorMessage(), Format(), Free_Index, FreeBiblAttribOptionFieldName, CHitBorders::GetBorderIndicesString(), CHitBorders::GetBreakCollectionIndexByName(), CConcXml::GetFreeBibiAttributesDescr(), CStringIndexator::GetIndexByName(), GetIndexTypeStr(), CStringIndexator::GetIndicesString(), GetLanguageByString(), GetStringByLanguage(), CConcXml::GetTextAreasDescr(), LoadFileToString(), m_bArchiveIndex, m_bCaseSensitive, m_bDisableDefaultQueryLexicalExpansion, m_bDwdsCorpusInterface, m_bEmptyLineIsSentenceDelim, m_bGutenbergInterface, m_Bibl, CIndexSetForLoadingStage::m_BigramBorder, m_bIndexChunks, m_bIndexMorphPatterns, m_bIndexPunctuation, m_bNoContextOperator, m_bOutputBibliographyOfHits, m_bQueryOnlyFiles, m_bResumeOnIndexErrors, m_bShowNumberOfRelevantDocuments, m_bUseDwdsThesaurus, m_bUseIndention, m_bUseParagraphTagToDivide, m_bUserMaxTokenCountInOnePeriod, m_HtmlHighlighting, m_IndexType, CStringIndexator::m_Indices, m_IndicesToShow, m_InternetPathPrefix, m_InterpDelimiter, m_Language, m_LeftKwicContextSize, m_LocalPathPrefix, CIndexSetForLoadingStage::m_MaxBigramWindowSize, CStringIndexator::m_MaxRegExpExpansionSize, m_NearRank, m_NumberOfKwicLinesInSnippets, m_PcreCharacterTables, m_PositionRank, m_RightKwicContextSize, m_TextHighlighting, m_TfIdfRank, m_TokenDelimiter, m_UserMaxTokenCountInOnePeriod, m_Utf8, morphEnglish, morphUnknown, MorphXML_Index, CHighlightTags::ReadFromString(), ReadIndexTypeFromStr(), CHitBorders::RegisterBorderIndices(), CStringIndexator::RegisterChunkIndex(), CConcXml::RegisterFreeBiblAttributes(), CStringIndexator::RegisterStringIndices(), CConcXml::RegisterTextAreas(), RmlMakeLower(), RmlPcreMakeTables(), TextAreaOptionFieldName, Trim(), unescapeCString(), and StringTokenizer::val().

Referenced by CreateAsUnion(), and LoadSourceFilesAndOptions().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::IsDWDSToken ( const CGraphmatFile piGraphmat,
long  GraLine 
) const [private]

graphematical definition of a token for DWDSIndex

graphematical definition of a token

References CUnitHolder::HasDescr(), IsDigit(), IsSentenceEnd(), IsWord(), m_bIndexPunctuation, and OPun.

Referenced by IndexTextOrHtmlFile().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::HasEqualOptions ( const CConcIndexator X  )  const [private]

checks if X has the same option

References SaveOptionsToString().

Referenced by CreateAsUnion().

Here is the call graph for this function:

Here is the caller graph for this function:

const char * CConcIndexator::GetIndexTypeStr (  )  const [private]

return a string representation of index type

References DWDS_Index, Free_Index, m_IndexType, and MorphXML_Index.

Referenced by LoadOptionsFromString(), and SaveOptionsToString().

Here is the caller graph for this function:

bool CConcIndexator::ReadIndexTypeFromStr ( const string &  s  )  [private]

read the index type from a string

References m_IndexType.

Referenced by LoadOptionsFromString().

Here is the caller graph for this function:

bool CConcIndexator::LoadXmlFile ( string  FileName,
const char *  pFileBuffer,
CGraphmatFile piGraphmat,
CBibliography Bibl,
string &  strError 
) [private]
bool CConcIndexator::LoadFileIntoGraphan ( string  FileName,
const char *  pFileBuffer,
CGraphmatFile piGraphmat,
CBibliography Bibl,
string &  strError 
) [private]
void CConcIndexator::InitDefaultOptions (  )  [private]
RML_RE::Options CConcIndexator::GetRegexOptions (  )  const [inline]

References m_PcreCharacterTables, and m_Utf8.

Referenced by CQueryTokenNode::BuildRegExp(), and CQueryParser::ParseQueryOperators().

Here is the caller graph for this function:

bool CConcIndexator::IsDwdsCorpusInterface (  )  const [inline]

return true, if DDC outputs results in DWDS format

References m_bDwdsCorpusInterface.

Referenced by CConcHolder::ShowBibliographyForTextOrHtml().

Here is the caller graph for this function:

bool CConcIndexator::IsGutenbergInterface (  )  const [inline]

return true, if DDC outputs results in Gutenberg project format

References m_bGutenbergInterface.

Referenced by CConcHolder::ShowBibliographyForTextOrHtml().

Here is the caller graph for this function:

bool CConcIndexator::HasContextOperator (  )  const [inline]

return true, if query context operator (Cntxt) is switched off

References m_bNoContextOperator.

Referenced by CQueryParser::ParseQuery().

Here is the caller graph for this function:

bool CConcIndexator::UseDwdsThesaurus (  )  const [inline]

return true, if DWDS thesaurus is enabled (index "Thes")

References m_bUseDwdsThesaurus.

Referenced by CConcIndexatorInvoker::BuildIndex().

Here is the caller graph for this function:

bool CConcIndexator::OutputBibliographyOfHits (  )  const [inline]

return true, if DDC should output bibliographical information for hits instead of corpus file names

References m_bOutputBibliographyOfHits.

Referenced by CConcHolder::GenerateOneHitString(), and CConcHolder::GenerateOneHitStringJson().

Here is the caller graph for this function:

string CConcIndexator::GetHtmlReference ( size_t  posFile  )  const

get an HTML formatted reference to a corpus file

References Format(), m_CorpusFiles, m_InternetPathPrefix, and m_LocalPathPrefix.

Referenced by CConcHolder::AddFileReference().

Here is the call graph for this function:

Here is the caller graph for this function:

string CConcIndexator::GetShortFilename ( size_t  posFile  )  const

get a reference to a corpus file without the common left prefix

References m_CommonFilePrefix, and m_CorpusFiles.

Referenced by CConcHolder::AddFileReference(), and CConcHolder::ShowBibliographyForTextOrHtml().

Here is the caller graph for this function:

string CConcIndexator::GetFileNameForCorpusFileNames (  )  const

get file name for storing corpus file names

References CStringIndexator::m_Path, and MakeFName().

Referenced by CreateAsUnion(), DestroyIndex(), CConcIndexatorInvoker::FinalizeIndex(), LoadCorpusFiles(), and SaveCorpusFileList().

Here is the call graph for this function:

Here is the caller graph for this function:

string CConcIndexator::GetFileNameForMaskedFiles (  )  const

get file name for masked files

References CStringIndexator::m_Path, and MakeFName().

Referenced by DestroyIndex(), and LoadMaskedFiles().

Here is the call graph for this function:

Here is the caller graph for this function:

vector< string > CConcIndexator::GetTokenFields ( const COutputToken tok  ) 

parse a delimited token into fields by splitting on m_InterpDelimiter

References m_IndicesToShow, m_InterpDelimiter, COutputToken::m_InterpStr, and COutputToken::m_TokenStr.

Referenced by CConcHolder::BuildJsonContextString().

Here is the caller graph for this function:

void CConcIndexator::InitGraphanProperties ( CGraphmatFile piGraphmat  )  const
bool CConcIndexator::WasIndexed (  )  const

true, when the corpus index was stored to the disk

References CStringIndexator::GetSearchPeriodsFileName(), and CStringIndexator::m_Path.

Here is the call graph for this function:

bool CConcIndexator::LoadSourceFilesAndOptions ( string  FileName  ) 
bool CConcIndexator::LoadCorpusFiles (  ) 

load list of corpus files (*.con)

References DDCVersion, ErrorMessage(), Format(), GetFileNameForCorpusFileNames(), m_CommonFilePrefix, m_CorpusFiles, CStringIndexator::m_Path, and Trim().

Referenced by LoadProject().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::LoadMaskedFiles (  ) 

load list of masked (deleted)Corpus File Definition

References concord_daemon_log(), Format(), GetFileNameForMaskedFiles(), m_CorpusFiles, m_MaskedFiles, and Trim().

Referenced by LoadProject().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::SaveOptions ( string  FileName  )  const

saves options to option file (*.opt)

References MakeFName(), and SaveOptionsToString().

Referenced by CreateAsUnion().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::SaveCorpusFileList (  )  const

saves corpus file list (*._con)

References DDCVersion, GetFileNameForCorpusFileNames(), and m_CorpusFiles.

Referenced by CreateAsUnion(), and CConcIndexatorInvoker::FinalizeIndex().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::LoadProject ( string  FileName  ) 
bool CConcIndexator::StartIndexing (  ) 

begins indexing

References m_Bibl, CStringIndexator::m_Path, CConcXml::Start(), CHitBorders::StartIndexing(), and CStringIndexator::StartIndexing().

Referenced by CConcIndexatorInvoker::BuildIndex().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::DestroyIndex (  ) 
bool CConcIndexator::NormalEndIndexing (  ) 

finishes indexing (normal way)

References CConcXml::FinalSaveBibliography(), and m_Bibl.

Referenced by CConcIndexatorInvoker::FinalizeIndex().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::TerminateIndexing (  ) 

terminates indexing (for exceptions)

References CHitBorders::BordersEndIndexing(), CConcXml::ExitWithoutSave(), m_Bibl, CStringIndexator::m_Path, and CStringIndexator::TerminateIndexing().

Referenced by CConcIndexatorInvoker::BuildIndex().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::IndexOneFile ( CGraphmatFile piGraphmat,
string  FileName,
const char *  pFileBuffer,
const CDwdsThesaurus pDwdsThesaurus,
CTokenNo CorpusEndTokenNo,
string &  strError 
)

index one file according to m_IndexType

References DWDS_Index, Free_Index, IndexMorphXml(), IndexTable(), IndexTextOrHtmlFile(), m_IndexType, and MorphXML_Index.

Referenced by CConcIndexatorInvoker::BuildIndex().

Here is the call graph for this function:

Here is the caller graph for this function:

void CConcIndexator::CalculateSearchPeriods ( DWORD  MaxTokenCountInOnePeriod  ) 

finds all subcorpora

References CHitBorders::GetCorpusEndTokenNo(), CHitBorders::GetFileBreaks(), CHitBorders::GetFileStartTokenNo(), m_CorpusFiles, and CStringIndexator::m_SearchPeriods.

Referenced by CreateAsUnion(), and CConcIndexatorInvoker::FinalizeIndex().

Here is the call graph for this function:

Here is the caller graph for this function:

bool CConcIndexator::CreateAsUnion ( const CConcIndexator _X1,
const CConcIndexator _X2 
)
bool CConcIndexator::CreateMorphIndex (  ) 
DWORD CConcIndexator::GetMaxTokenCountInOnePeriod (  )  const

returns the size of one subcorpus

References m_bUserMaxTokenCountInOnePeriod, and m_UserMaxTokenCountInOnePeriod.

Referenced by CConcIndexatorInvoker::BuildIndex(), CreateAsUnion(), and CreateMorphIndex().

Here is the caller graph for this function:

string CConcIndexator::GetIndexItemSetByVectorString ( const vector< string > &  TokenProperties,
bool  bRegexp 
)

return a string representation of a set of token properties (in the format which is used in the index)

References MorphAnnotationsDelimRegExp.

Referenced by CQueryTokenNode::CreateMorphAnnotationPattern(), CreateMorphIndex(), and IndexMorphXml().

Here is the caller graph for this function:


Member Data Documentation

a table of character properties for regular expressions which depend on CConcIndexator::m_Language

Referenced by GetRegexOptions(), and LoadOptionsFromString().

Enables using "<p>" tag as a paragraph delimiter.

Referenced by InitDefaultOptions(), InitGraphanProperties(), LoadOptionsFromString(), and SaveOptionsToString().

if m_bEmptyLineIsSentenceDelim is on, every empty line in the input file is considered to be the end of the sentence.

Referenced by InitDefaultOptions(), InitGraphanProperties(), LoadOptionsFromString(), and SaveOptionsToString().

if m_bUseIndention is on, the program tries to find paragraphs using indentions

Referenced by InitDefaultOptions(), InitGraphanProperties(), LoadOptionsFromString(), and SaveOptionsToString().

if m_bDwdsCorpusInterface is on, the program outputs results in DWDS format

Referenced by InitDefaultOptions(), IsDwdsCorpusInterface(), LoadOptionsFromString(), and SaveOptionsToString().

if m_bGutenbergInterface is on, the program outputs results in a format of Gutenberg project

Referenced by InitDefaultOptions(), IsGutenbergInterface(), LoadOptionsFromString(), and SaveOptionsToString().

should we switch off context operator (Cntxt) due copyright

Referenced by HasContextOperator(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

The maximal number of occurrences in one subcorpora (defined by user).

Referenced by GetMaxTokenCountInOnePeriod(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

Enables indexing and querying using DWDS Thesaurus.

Referenced by IndexTextOrHtmlFile(), InitDefaultOptions(), LoadOptionsFromString(), SaveOptionsToString(), and UseDwdsThesaurus().

Should we show bibliography of the hits instead of filename.

Referenced by InitDefaultOptions(), LoadOptionsFromString(), OutputBibliographyOfHits(), and SaveOptionsToString().

Enables indexing all punctuation marks.

Referenced by InitDefaultOptions(), IsDWDSToken(), LoadOptionsFromString(), and SaveOptionsToString().

Enables the index of morph patterns.

Referenced by CreateMorphIndex(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

Enables indexing and querying using chunks.

Referenced by IndexOneTableTextArea(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

if true, then DDC always calculates the number of documents, where at lease one hit is found

Referenced by CConcHolder::GetAllHits(), InitDefaultOptions(), LoadOptionsFromString(), SaveOptionsToString(), and CConcHolder::SimpleQuery().

prohibits sentence break collection under DWDS_Index or MorphXML_Index

Referenced by IndexTextOrHtmlFile(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

sets that index should be archived under DWDS_Index or MorphXML_Index

Referenced by InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

masked (deleted corpus files)

Referenced by DestroyIndex(), CConcHolder::GetAllHits(), and LoadMaskedFiles().

if true, then no default lexical expansion fo querz words occurs

Referenced by CQueryTokenNode::CreateTokenPattern(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

the size of the left context of the highlighted words in document search

Referenced by CConcHolder::GetFileSnippets(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

the size of the right context of the highlighted words in document search

Referenced by CConcHolder::GetFileSnippets(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

the maximal number of kwic lines in file snippets

Referenced by CConcHolder::GetFileSnippets(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

the parameter for Position ranking

Referenced by InitDefaultOptions(), CConcHolder::InitLessByRank(), LoadOptionsFromString(), and SaveOptionsToString().

delimiter to use between token index fields in output

Referenced by CConcHolder::GenerateHitStrings(), GetTokenFields(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().

delimiter to use between tokens in output

Referenced by CConcHolder::GenerateOneHitString(), CConcHolder::GetContext(), InitDefaultOptions(), and LoadOptionsFromString().


The documentation for this class was generated from the following files: