#include <ConcIndexator.h>
enum DDCIndexTypeEnum contains index types. Each index type determines DDC indices and break collections.
More...CConcIndexator is the central class of DDC technology. The most of its slots come from the two parent classes:CStringIndexator (indexing tokens and its properties) and CHitBorders (indexing corpus divisions) This class also contains a list of corpus files and some indexing and querying options.
enum CConcIndexator::DDCIndexTypeEnum [private] |
enum DDCIndexTypeEnum contains index types. Each index type determines DDC indices and break collections.
DWDS_Index |
A type for corpus without annotations, which are written for each word. Fr example the input text can be a plain text. DDC always builds a token index and a file break collection for this index type. Optionally DDC can build "Thes" index, "Morph" index and a sentence collection. |
MorphXML_Index |
A type for xml-texts, if their words have predefined and written annotations. DDC always builds a token index and a "MorphPattern" index. It also creates a file and a sentence break collection. |
Free_Index |
This index type is free and therefore it should be defined in the options file (fields "Indices" and "HitBorders"). The corpus should consists of xml-files with a bibliographical header and a body (text). The text is written in CWB format (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CWBTutorial/cwb-tutorial.pdf). The original CWB format was changed in the following way. Instead of line breaks which are used to delimit records in the input file, DDC uses a special tag CConcCommon.h::PredefinedTableLineTag. This is done because line breaks are not preserved by the XML-parser. |
CConcIndexator::CConcIndexator | ( | ) |
CConcIndexator::~CConcIndexator | ( | ) |
bool CConcIndexator::IndexTextOrHtmlFile | ( | CGraphmatFile * | piGraphmat, | |
string | FileName, | |||
const char * | pFileBuffer, | |||
const CDwdsThesaurus * | pDwdsThesaurus, | |||
CTokenNo & | NewCorpusEndTokenNo, | |||
string & | strError | |||
) | [private] |
References CHitBorders::AddBreakByIndex(), CHitBorders::AddBreakByName(), CConcXml::AddIndexItem(), CHitBorders::AddPageBreak(), Format(), CUnitHolder::FreeTable(), CDwdsThesaurus::GetAllThesInterpetations(), CHitBorders::GetBreakCollectionIndexByName(), CUnitHolder::GetPageNumber(), CConcXml::GetTextAreasCount(), CUnitHolder::GetTokensCount(), CUnitHolder::GetUnits(), CUnitHolder::GetUppercaseToken(), globalFieldDelimeter, CUnitHolder::HasDescr(), CStringIndexator::IndexOneToken(), IsDWDSToken(), IsSentenceEnd(), LoadFileIntoGraphan(), m_Bibl, m_bQueryOnlyFiles, m_bUseDwdsThesaurus, CPageNumber::m_PageNumber, CPageNumber::m_StartTokenNo, OUp, OUpLw, PredefinedTextAreaBreakName, and CStringIndexator::ProcessBigramBorders().
Referenced by IndexOneFile().
bool CConcIndexator::IndexMorphXml | ( | string | FileName, | |
const char * | pFileBuffer, | |||
CTokenNo & | NewCorpusEndTokenNo, | |||
string & | strError | |||
) | [private] |
References CHitBorders::AddBreakByName(), CConcXml::AddIndexItem(), CHitBorders::AddPageBreak(), CXmlMorphAnnot::GetAsSetOfProperties(), GetIndexItemSetByVectorString(), globalFieldDelimeter, CStringIndexator::IndexOneToken(), CXmlToken::m_Annots, m_Bibl, CXmlToken::m_bLastInSentence, CXmlMorphAnnot::m_Lemma, CPageNumber::m_PageNumber, CPageNumber::m_StartTokenNo, CXmlToken::m_Type, CXmlToken::m_WordStr, MorphAnnotationsDelim, CConcXml::ReadMorphXmlFileIntoGraTable(), UnknownPageNumber, and CBibliography::WriteToString().
Referenced by IndexOneFile().
bool CConcIndexator::IndexTable | ( | string | FileName, | |
const char * | pFileBuffer, | |||
CTokenNo & | NewCorpusEndTokenNo, | |||
string & | strError | |||
) | [private] |
References CConcXml::AddIndexItem(), GetCWBFormattedStringRecursive(), CConcXml::GetTextAreaElements(), IndexOneTableTextArea(), CConcXml::LoadXmlAndReadBibliography(), m_Bibl, CPageNumber::m_PageNumber, CBibliography::m_StartPageInfo, and CPageNumber::m_StartTokenNo.
Referenced by IndexOneFile().
bool CConcIndexator::IndexOneTableTextArea | ( | const string & | Text, | |
const CPageNumber & | StartPageFromHeader, | |||
size_t & | page_breaks_count, | |||
CTokenNo & | NewCorpusEndTokenNo, | |||
string & | strError | |||
) | [private] |
References CHitBorders::AddBreakByIndex(), CHitBorders::AddPageBreak(), CHitBorders::EndTextAreaBorders(), Format(), CHitBorders::GetBreakCollectionIndexByName(), CStringIndexator::IndexOneToken(), CIndexSetForLoadingStage::InsertToInputLoadIndex(), CHitBorders::IsRegisteredBreak(), m_bIndexChunks, CPageNumber::m_PageNumber, CStringIndexator::m_pChunkIndex, CPageNumber::m_StartTokenNo, CExpc::m_strCause, CStringIndexator::ProcessBigramBorders(), CHitBorders::StartTextAreaBorders(), and Trim().
Referenced by IndexTable().
void CConcIndexator::AssertHasPath | ( | ) | const [private] |
References ErrorMessage(), errUnknownPath, and CStringIndexator::m_Path.
Referenced by DestroyIndex(), and LoadProject().
string CConcIndexator::GetBiblIndexFileName | ( | ) | const [private] |
string CConcIndexator::GetBiblFileName | ( | ) | const [private] |
string CConcIndexator::SaveOptionsToString | ( | ) | const [private] |
saves options to a string
References DefaultKwicContextSize, Format(), CHitBorders::GetBorderIndicesString(), CHitBorders::GetBreakCollectionShortName(), CConcXml::GetFreeBibiAttributesDescr(), GetIndexTypeStr(), CStringIndexator::GetIndicesString(), GetStringByLanguage(), m_bArchiveIndex, m_bCaseSensitive, m_bDisableDefaultQueryLexicalExpansion, m_bDwdsCorpusInterface, m_bEmptyLineIsSentenceDelim, m_bGutenbergInterface, m_Bibl, m_bIndexChunks, m_bIndexMorphPatterns, m_bIndexPunctuation, m_bNoContextOperator, m_bOutputBibliographyOfHits, m_bQueryOnlyFiles, m_bResumeOnIndexErrors, m_bShowNumberOfRelevantDocuments, m_bUseDwdsThesaurus, m_bUseIndention, m_bUseParagraphTagToDivide, m_bUserMaxTokenCountInOnePeriod, CHighlightTags::m_bWasReadFromString, m_HtmlHighlighting, CStringIndexator::m_Indices, m_IndicesToShow, m_InternetPathPrefix, m_InterpDelimiter, m_Language, m_LeftKwicContextSize, m_LocalPathPrefix, CStringIndexator::m_MaxRegExpExpansionSize, m_NearRank, m_NumberOfKwicLinesInSnippets, m_PositionRank, m_RightKwicContextSize, m_TextHighlighting, m_TfIdfRank, m_UserMaxTokenCountInOnePeriod, morphUnknown, and CHighlightTags::ToString().
Referenced by CreateAsUnion(), HasEqualOptions(), and SaveOptions().
bool CConcIndexator::LoadOptionsFromString | ( | string | Options | ) | [private] |
loads options from a string
References DWDS_Index, ErrorMessage(), Format(), Free_Index, FreeBiblAttribOptionFieldName, CHitBorders::GetBorderIndicesString(), CHitBorders::GetBreakCollectionIndexByName(), CConcXml::GetFreeBibiAttributesDescr(), CStringIndexator::GetIndexByName(), GetIndexTypeStr(), CStringIndexator::GetIndicesString(), GetLanguageByString(), GetStringByLanguage(), CConcXml::GetTextAreasDescr(), LoadFileToString(), m_bArchiveIndex, m_bCaseSensitive, m_bDisableDefaultQueryLexicalExpansion, m_bDwdsCorpusInterface, m_bEmptyLineIsSentenceDelim, m_bGutenbergInterface, m_Bibl, CIndexSetForLoadingStage::m_BigramBorder, m_bIndexChunks, m_bIndexMorphPatterns, m_bIndexPunctuation, m_bNoContextOperator, m_bOutputBibliographyOfHits, m_bQueryOnlyFiles, m_bResumeOnIndexErrors, m_bShowNumberOfRelevantDocuments, m_bUseDwdsThesaurus, m_bUseIndention, m_bUseParagraphTagToDivide, m_bUserMaxTokenCountInOnePeriod, m_HtmlHighlighting, m_IndexType, CStringIndexator::m_Indices, m_IndicesToShow, m_InternetPathPrefix, m_InterpDelimiter, m_Language, m_LeftKwicContextSize, m_LocalPathPrefix, CIndexSetForLoadingStage::m_MaxBigramWindowSize, CStringIndexator::m_MaxRegExpExpansionSize, m_NearRank, m_NumberOfKwicLinesInSnippets, m_PcreCharacterTables, m_PositionRank, m_RightKwicContextSize, m_TextHighlighting, m_TfIdfRank, m_TokenDelimiter, m_UserMaxTokenCountInOnePeriod, m_Utf8, morphEnglish, morphUnknown, MorphXML_Index, CHighlightTags::ReadFromString(), ReadIndexTypeFromStr(), CHitBorders::RegisterBorderIndices(), CStringIndexator::RegisterChunkIndex(), CConcXml::RegisterFreeBiblAttributes(), CStringIndexator::RegisterStringIndices(), CConcXml::RegisterTextAreas(), RmlMakeLower(), RmlPcreMakeTables(), TextAreaOptionFieldName, Trim(), unescapeCString(), and StringTokenizer::val().
Referenced by CreateAsUnion(), and LoadSourceFilesAndOptions().
bool CConcIndexator::IsDWDSToken | ( | const CGraphmatFile * | piGraphmat, | |
long | GraLine | |||
) | const [private] |
graphematical definition of a token for DWDSIndex
graphematical definition of a token
References CUnitHolder::HasDescr(), IsDigit(), IsSentenceEnd(), IsWord(), m_bIndexPunctuation, and OPun.
Referenced by IndexTextOrHtmlFile().
bool CConcIndexator::HasEqualOptions | ( | const CConcIndexator & | X | ) | const [private] |
checks if X has the same option
References SaveOptionsToString().
Referenced by CreateAsUnion().
const char * CConcIndexator::GetIndexTypeStr | ( | ) | const [private] |
return a string representation of index type
References DWDS_Index, Free_Index, m_IndexType, and MorphXML_Index.
Referenced by LoadOptionsFromString(), and SaveOptionsToString().
bool CConcIndexator::ReadIndexTypeFromStr | ( | const string & | s | ) | [private] |
read the index type from a string
References m_IndexType.
Referenced by LoadOptionsFromString().
bool CConcIndexator::LoadXmlFile | ( | string | FileName, | |
const char * | pFileBuffer, | |||
CGraphmatFile * | piGraphmat, | |||
CBibliography & | Bibl, | |||
string & | strError | |||
) | [private] |
References ErrorMessage(), Format(), CGraphmatFile::GetLastError(), CConcXml::GetTextAreaElements(), CConcXml::GetTextAreasCount(), GetTextFromXMLRecursive(), CGraphmatFile::LoadStringToGraphan(), CConcXml::LoadXmlAndReadBibliography(), m_Bibl, CBibliography::m_StartPageInfo, and UnknownPageNumber.
Referenced by LoadFileIntoGraphan().
bool CConcIndexator::LoadFileIntoGraphan | ( | string | FileName, | |
const char * | pFileBuffer, | |||
CGraphmatFile * | piGraphmat, | |||
CBibliography & | Bibl, | |||
string & | strError | |||
) | [private] |
References CBibliography::CleanBibliography(), ErrorMessage(), Format(), CGraphmatFile::GetLastError(), IsXmlFile(), CGraphmatFile::LoadStringToGraphan(), LoadXmlFile(), m_Bibl, CExpc::m_strCause, and CConcXml::SetFreeBiblAttribsEmpty().
Referenced by IndexTextOrHtmlFile().
void CConcIndexator::InitDefaultOptions | ( | ) | [private] |
References DefaultKwicContextSize, m_bArchiveIndex, m_bCaseSensitive, m_bDisableDefaultQueryLexicalExpansion, m_bDwdsCorpusInterface, m_bEmptyLineIsSentenceDelim, m_bGutenbergInterface, m_bIndexChunks, m_bIndexMorphPatterns, m_bIndexPunctuation, m_bNoContextOperator, m_bOutputBibliographyOfHits, m_bQueryOnlyFiles, m_bResumeOnIndexErrors, m_bShowNumberOfRelevantDocuments, m_bUseDwdsThesaurus, m_bUseIndention, m_bUseParagraphTagToDivide, m_bUserMaxTokenCountInOnePeriod, CHighlightTags::m_FirstCloser, CHighlightTags::m_FirstOpener, m_HtmlHighlighting, m_IndexType, m_InterpDelimiter, m_Language, m_LeftKwicContextSize, m_NearRank, m_NumberOfKwicLinesInSnippets, CStringIndexator::m_Path, m_PositionRank, CHighlightTags::m_RestCloser, CHighlightTags::m_RestOpener, m_RightKwicContextSize, m_TextHighlighting, m_TfIdfRank, m_TokenDelimiter, m_UserMaxTokenCountInOnePeriod, and m_Utf8.
Referenced by CConcIndexator().
RML_RE::Options CConcIndexator::GetRegexOptions | ( | ) | const [inline] |
References m_PcreCharacterTables, and m_Utf8.
Referenced by CQueryTokenNode::BuildRegExp(), and CQueryParser::ParseQueryOperators().
bool CConcIndexator::IsDwdsCorpusInterface | ( | ) | const [inline] |
return true, if DDC outputs results in DWDS format
References m_bDwdsCorpusInterface.
Referenced by CConcHolder::ShowBibliographyForTextOrHtml().
bool CConcIndexator::IsGutenbergInterface | ( | ) | const [inline] |
return true, if DDC outputs results in Gutenberg project format
References m_bGutenbergInterface.
Referenced by CConcHolder::ShowBibliographyForTextOrHtml().
bool CConcIndexator::HasContextOperator | ( | ) | const [inline] |
return true, if query context operator (Cntxt) is switched off
References m_bNoContextOperator.
Referenced by CQueryParser::ParseQuery().
bool CConcIndexator::UseDwdsThesaurus | ( | ) | const [inline] |
return true, if DWDS thesaurus is enabled (index "Thes")
References m_bUseDwdsThesaurus.
Referenced by CConcIndexatorInvoker::BuildIndex().
bool CConcIndexator::OutputBibliographyOfHits | ( | ) | const [inline] |
return true, if DDC should output bibliographical information for hits instead of corpus file names
References m_bOutputBibliographyOfHits.
Referenced by CConcHolder::GenerateOneHitString(), and CConcHolder::GenerateOneHitStringJson().
string CConcIndexator::GetHtmlReference | ( | size_t | posFile | ) | const |
get an HTML formatted reference to a corpus file
References Format(), m_CorpusFiles, m_InternetPathPrefix, and m_LocalPathPrefix.
Referenced by CConcHolder::AddFileReference().
string CConcIndexator::GetShortFilename | ( | size_t | posFile | ) | const |
get a reference to a corpus file without the common left prefix
References m_CommonFilePrefix, and m_CorpusFiles.
Referenced by CConcHolder::AddFileReference(), and CConcHolder::ShowBibliographyForTextOrHtml().
string CConcIndexator::GetFileNameForCorpusFileNames | ( | ) | const |
get file name for storing corpus file names
References CStringIndexator::m_Path, and MakeFName().
Referenced by CreateAsUnion(), DestroyIndex(), CConcIndexatorInvoker::FinalizeIndex(), LoadCorpusFiles(), and SaveCorpusFileList().
string CConcIndexator::GetFileNameForMaskedFiles | ( | ) | const |
get file name for masked files
References CStringIndexator::m_Path, and MakeFName().
Referenced by DestroyIndex(), and LoadMaskedFiles().
vector< string > CConcIndexator::GetTokenFields | ( | const COutputToken & | tok | ) |
parse a delimited token into fields by splitting on m_InterpDelimiter
References m_IndicesToShow, m_InterpDelimiter, COutputToken::m_InterpStr, and COutputToken::m_TokenStr.
Referenced by CConcHolder::BuildJsonContextString().
void CConcIndexator::InitGraphanProperties | ( | CGraphmatFile * | piGraphmat | ) | const |
initializes graphematics using current options
References CGraphmatFile::m_bConvertRussianJo2Je, m_bEmptyLineIsSentenceDelim, CGraphmatFile::m_bEmptyLineIsSentenceDelim, CGraphmatFile::m_bFilterUnprintableSymbols, m_bUseIndention, CGraphmatFile::m_bUseIndention, m_bUseParagraphTagToDivide, CGraphmatFile::m_bUseParagraphTagToDivide, m_Language, and CUnitHolder::m_Language.
Referenced by CConcIndexatorInvoker::BuildIndex().
bool CConcIndexator::WasIndexed | ( | ) | const |
true, when the corpus index was stored to the disk
References CStringIndexator::GetSearchPeriodsFileName(), and CStringIndexator::m_Path.
bool CConcIndexator::LoadSourceFilesAndOptions | ( | string | FileName | ) |
load list of source files and parses option file (*.opt)
References concord_daemon_log(), Format(), LoadFileToString(), LoadOptionsFromString(), m_bUserMaxTokenCountInOnePeriod, m_InternetPathPrefix, m_LocalPathPrefix, CStringIndexator::m_Path, MakeFName(), CSourceFileHolder::ReadSourceFileList(), and CStringIndexator::SetPath().
Referenced by CConcIndexatorInvoker::BuildIndex(), and LoadProject().
bool CConcIndexator::LoadCorpusFiles | ( | ) |
load list of corpus files (*.con)
References DDCVersion, ErrorMessage(), Format(), GetFileNameForCorpusFileNames(), m_CommonFilePrefix, m_CorpusFiles, CStringIndexator::m_Path, and Trim().
Referenced by LoadProject().
bool CConcIndexator::LoadMaskedFiles | ( | ) |
load list of masked (deleted)Corpus File Definition
References concord_daemon_log(), Format(), GetFileNameForMaskedFiles(), m_CorpusFiles, m_MaskedFiles, and Trim().
Referenced by LoadProject().
bool CConcIndexator::SaveOptions | ( | string | FileName | ) | const |
saves options to option file (*.opt)
References MakeFName(), and SaveOptionsToString().
Referenced by CreateAsUnion().
bool CConcIndexator::SaveCorpusFileList | ( | ) | const |
saves corpus file list (*._con)
References DDCVersion, GetFileNameForCorpusFileNames(), and m_CorpusFiles.
Referenced by CreateAsUnion(), and CConcIndexatorInvoker::FinalizeIndex().
bool CConcIndexator::LoadProject | ( | string | FileName | ) |
loads everything
References AssertHasPath(), ErrorMessage(), Format(), CHitBorders::GetFileBreaks(), CConcXml::LoadBibl(), LoadCorpusFiles(), CHitBorders::LoadHitBorders(), LoadMaskedFiles(), LoadSourceFilesAndOptions(), m_Bibl, m_CorpusFiles, CStringIndexator::m_Path, and CStringIndexator::ReadIndicesFromTheDisk().
Referenced by CConcIndexatorInvoker::BuildOnlyMorphIndex(), CDDCCorpusListenHost::LoadHolder(), and CConcordance::LoadProject().
bool CConcIndexator::StartIndexing | ( | ) |
begins indexing
References m_Bibl, CStringIndexator::m_Path, CConcXml::Start(), CHitBorders::StartIndexing(), and CStringIndexator::StartIndexing().
Referenced by CConcIndexatorInvoker::BuildIndex().
bool CConcIndexator::DestroyIndex | ( | ) |
destroy all index files
References AssertHasPath(), CStringIndexator::DestroyIndices(), FileExists(), GetFileNameForCorpusFileNames(), GetFileNameForMaskedFiles(), CStringIndexator::GetSearchPeriodsFileName(), m_CorpusFiles, m_MaskedFiles, CStringIndexator::m_Path, and CHitBorders::RemoveHitBordersFileAndClear().
Referenced by CConcIndexatorInvoker::BuildIndex().
bool CConcIndexator::NormalEndIndexing | ( | ) |
finishes indexing (normal way)
References CConcXml::FinalSaveBibliography(), and m_Bibl.
Referenced by CConcIndexatorInvoker::FinalizeIndex().
bool CConcIndexator::TerminateIndexing | ( | ) |
terminates indexing (for exceptions)
References CHitBorders::BordersEndIndexing(), CConcXml::ExitWithoutSave(), m_Bibl, CStringIndexator::m_Path, and CStringIndexator::TerminateIndexing().
Referenced by CConcIndexatorInvoker::BuildIndex().
bool CConcIndexator::IndexOneFile | ( | CGraphmatFile * | piGraphmat, | |
string | FileName, | |||
const char * | pFileBuffer, | |||
const CDwdsThesaurus * | pDwdsThesaurus, | |||
CTokenNo & | CorpusEndTokenNo, | |||
string & | strError | |||
) |
index one file according to m_IndexType
References DWDS_Index, Free_Index, IndexMorphXml(), IndexTable(), IndexTextOrHtmlFile(), m_IndexType, and MorphXML_Index.
Referenced by CConcIndexatorInvoker::BuildIndex().
void CConcIndexator::CalculateSearchPeriods | ( | DWORD | MaxTokenCountInOnePeriod | ) |
finds all subcorpora
References CHitBorders::GetCorpusEndTokenNo(), CHitBorders::GetFileBreaks(), CHitBorders::GetFileStartTokenNo(), m_CorpusFiles, and CStringIndexator::m_SearchPeriods.
Referenced by CreateAsUnion(), and CConcIndexatorInvoker::FinalizeIndex().
bool CConcIndexator::CreateAsUnion | ( | const CConcIndexator & | _X1, | |
const CConcIndexator & | _X2 | |||
) |
creates new concordance as unionof two concordances
References CSourceFileHolder::AddSourceFilesFrom(), CalculateSearchPeriods(), CStringIndexator::ClearStringIndices(), CSourceFileHolder::DeleteAllSourceFiles(), CStringIndexator::FinalSaveAllIndices(), CHitBorders::GetCorpusEndTokenNo(), GetFileNameForCorpusFileNames(), CStringIndexator::GetIndicesString(), GetMaxTokenCountInOnePeriod(), HasEqualOptions(), LoadOptionsFromString(), m_Bibl, m_CorpusFiles, CStringIndexator::m_Indices, CStringIndexator::m_Path, CExpc::m_strCause, SaveCorpusFileList(), SaveOptions(), SaveOptionsToString(), CSourceFileHolder::SaveSourceFileList(), CConcXml::SetPath(), CConcXml::UniteBibliography(), and CHitBorders::UniteBorders().
bool CConcIndexator::CreateMorphIndex | ( | ) |
creates morphology index
References CIndexSetForLoadingStage::AddInputLoadIndexToMemoryLoadIndex(), CIndexSetForLoadingStage::AddMemoryLoadIndexToMainLoadIndex(), CheckLanguage(), CIndexSetForLoadingStage::CreateTempFiles(), CStringIndexSet::DestroyIndexSet(), ErrorMessage(), GetGramInfosFromWord(), CStringIndexator::GetIndexByName(), GetIndexItemSetByVectorString(), CStringIndexSet::GetIndexItemStr(), GetMaxTokenCountInOnePeriod(), CIndexSetForLoadingStage::InsertToInputLoadIndex(), is_upper_alpha(), CIndexSetForQueryingStage::LoadIndexSet(), m_bIndexMorphPatterns, CIndexSetForQueryingStage::m_Index, m_Language, CStringIndexator::m_Path, CIndexSetForQueryingStage::ReadAllOccurrences(), CIndexSetForLoadingStage::SaveMemoryLoadIndex(), CIndexSetForLoadingStage::SortInputAndMemoryIndices(), and CStringIndexSet::WriteToFile().
Referenced by CConcIndexatorInvoker::BuildOnlyMorphIndex(), and CConcIndexatorInvoker::FinalizeIndex().
DWORD CConcIndexator::GetMaxTokenCountInOnePeriod | ( | ) | const |
returns the size of one subcorpus
References m_bUserMaxTokenCountInOnePeriod, and m_UserMaxTokenCountInOnePeriod.
Referenced by CConcIndexatorInvoker::BuildIndex(), CreateAsUnion(), and CreateMorphIndex().
string CConcIndexator::GetIndexItemSetByVectorString | ( | const vector< string > & | TokenProperties, | |
bool | bRegexp | |||
) |
return a string representation of a set of token properties (in the format which is used in the index)
References MorphAnnotationsDelimRegExp.
Referenced by CQueryTokenNode::CreateMorphAnnotationPattern(), CreateMorphIndex(), and IndexMorphXml().
vector<BYTE> CConcIndexator::m_PcreCharacterTables [private] |
a table of character properties for regular expressions which depend on CConcIndexator::m_Language
Referenced by GetRegexOptions(), and LoadOptionsFromString().
bool CConcIndexator::m_bUseParagraphTagToDivide [private] |
Enables using "<p>" tag as a paragraph delimiter.
Referenced by InitDefaultOptions(), InitGraphanProperties(), LoadOptionsFromString(), and SaveOptionsToString().
bool CConcIndexator::m_bEmptyLineIsSentenceDelim [private] |
if m_bEmptyLineIsSentenceDelim is on, every empty line in the input file is considered to be the end of the sentence.
Referenced by InitDefaultOptions(), InitGraphanProperties(), LoadOptionsFromString(), and SaveOptionsToString().
bool CConcIndexator::m_bUseIndention [private] |
if m_bUseIndention is on, the program tries to find paragraphs using indentions
Referenced by InitDefaultOptions(), InitGraphanProperties(), LoadOptionsFromString(), and SaveOptionsToString().
bool CConcIndexator::m_bDwdsCorpusInterface [private] |
if m_bDwdsCorpusInterface is on, the program outputs results in DWDS format
Referenced by InitDefaultOptions(), IsDwdsCorpusInterface(), LoadOptionsFromString(), and SaveOptionsToString().
bool CConcIndexator::m_bGutenbergInterface [private] |
if m_bGutenbergInterface is on, the program outputs results in a format of Gutenberg project
Referenced by InitDefaultOptions(), IsGutenbergInterface(), LoadOptionsFromString(), and SaveOptionsToString().
bool CConcIndexator::m_bNoContextOperator [private] |
should we switch off context operator (Cntxt) due copyright
Referenced by HasContextOperator(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
The maximal number of occurrences in one subcorpora (defined by user).
Referenced by GetMaxTokenCountInOnePeriod(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
bool CConcIndexator::m_bUserMaxTokenCountInOnePeriod [private] |
bool CConcIndexator::m_bUseDwdsThesaurus [private] |
Enables indexing and querying using DWDS Thesaurus.
Referenced by IndexTextOrHtmlFile(), InitDefaultOptions(), LoadOptionsFromString(), SaveOptionsToString(), and UseDwdsThesaurus().
bool CConcIndexator::m_bOutputBibliographyOfHits [private] |
Should we show bibliography of the hits instead of filename.
Referenced by InitDefaultOptions(), LoadOptionsFromString(), OutputBibliographyOfHits(), and SaveOptionsToString().
bool CConcIndexator::m_bIndexPunctuation [private] |
Enables indexing all punctuation marks.
Referenced by InitDefaultOptions(), IsDWDSToken(), LoadOptionsFromString(), and SaveOptionsToString().
DDCIndexTypeEnum CConcIndexator::m_IndexType [private] |
the type of index
Referenced by GetIndexTypeStr(), IndexOneFile(), InitDefaultOptions(), LoadOptionsFromString(), and ReadIndexTypeFromStr().
string CConcIndexator::m_InternetPathPrefix [private] |
Referenced by GetHtmlReference(), LoadOptionsFromString(), LoadSourceFilesAndOptions(), and SaveOptionsToString().
string CConcIndexator::m_LocalPathPrefix [private] |
Referenced by GetHtmlReference(), LoadOptionsFromString(), LoadSourceFilesAndOptions(), and SaveOptionsToString().
string CConcIndexator::m_CommonFilePrefix [private] |
Referenced by GetShortFilename(), and LoadCorpusFiles().
the language of the corpus
Referenced by CQueryTokenNode::CreateFileList(), CreateMorphIndex(), CQueryTokenNode::CreateThesPattern(), CQueryTokenNode::CreateTokenPattern(), InitDefaultOptions(), InitGraphanProperties(), LoadOptionsFromString(), and SaveOptionsToString().
Enables the index of morph patterns.
Referenced by CreateMorphIndex(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
Enables indexing and querying using chunks.
Referenced by IndexOneTableTextArea(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
if true, then the default search is case sensitive
Referenced by CQueryTokenNode::CreateFileList(), CQueryTokenNode::CreateTokenPattern(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
if true, then DDC always calculates the number of documents, where at lease one hit is found
Referenced by CConcHolder::GetAllHits(), InitDefaultOptions(), LoadOptionsFromString(), SaveOptionsToString(), and CConcHolder::SimpleQuery().
prohibits sentence break collection under DWDS_Index or MorphXML_Index
Referenced by IndexTextOrHtmlFile(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
sets that index should be archived under DWDS_Index or MorphXML_Index
Referenced by InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
if true, CConcIndexatorInvoker skips source documents with errors
Referenced by CConcIndexatorInvoker::BuildIndex(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
vector<string> CConcIndexator::m_CorpusFiles |
Referenced by CConcIndexatorInvoker::BuildIndex(), CalculateSearchPeriods(), CreateAsUnion(), DestroyIndex(), CConcHolder::GenerateOneHitStringJson(), CConcHolder::GetAllHits(), GetHtmlReference(), GetShortFilename(), LoadCorpusFiles(), LoadMaskedFiles(), LoadProject(), SaveCorpusFileList(), CConcHolder::ShowBibliographyForTable(), and CConcHolder::ShowBibliographyForTextOrHtml().
set<size_t> CConcIndexator::m_MaskedFiles |
masked (deleted corpus files)
Referenced by DestroyIndex(), CConcHolder::GetAllHits(), and LoadMaskedFiles().
a member which holds a index for bibliographical information
Referenced by CQueryNode::ConvertOccurrencesToHits(), CQueryNode::ConvertOccurrencesToHitsForPatterns(), CreateAsUnion(), CConcHolder::GenerateOneHitStringJson(), IndexMorphXml(), IndexTable(), IndexTextOrHtmlFile(), CConcHolder::InitLessByRank(), CConcHolder::InitOrderIDForHits(), LoadFileIntoGraphan(), LoadOptionsFromString(), LoadProject(), LoadXmlFile(), NormalEndIndexing(), CQueryParser::ParseQuery(), CQueryParser::ParseQueryOperators(), SaveOptionsToString(), CConcHolder::ShowBibliographyForTable(), CConcHolder::ShowBibliographyForTextOrHtml(), StartIndexing(), and TerminateIndexing().
highlighting tags for CConcHolder::m_ResultFormat == DDC_ResultHTML
Referenced by CConcHolder::GenerateOneHitString(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
highlighting delimeters for CConcHolder::m_ResultFormat == DDC_ResultTEXT
Referenced by CConcHolder::GenerateOneHitString(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
if true, then no default lexical expansion fo querz words occurs
Referenced by CQueryTokenNode::CreateTokenPattern(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
the size of the left context of the highlighted words in document search
Referenced by CConcHolder::GetFileSnippets(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
the size of the right context of the highlighted words in document search
Referenced by CConcHolder::GetFileSnippets(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
the maximal number of kwic lines in file snippets
Referenced by CConcHolder::GetFileSnippets(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
double CConcIndexator::m_TfIdfRank |
the parameter for TfIdf ranking
Referenced by InitDefaultOptions(), CConcHolder::InitLessByRank(), LoadOptionsFromString(), and SaveOptionsToString().
double CConcIndexator::m_NearRank |
the parameter for Near ranking
Referenced by InitDefaultOptions(), CConcHolder::InitLessByRank(), LoadOptionsFromString(), and SaveOptionsToString().
the parameter for Position ranking
Referenced by InitDefaultOptions(), CConcHolder::InitLessByRank(), LoadOptionsFromString(), and SaveOptionsToString().
delimiter to use between token index fields in output
Referenced by CConcHolder::GenerateHitStrings(), GetTokenFields(), InitDefaultOptions(), LoadOptionsFromString(), and SaveOptionsToString().
delimiter to use between tokens in output
Referenced by CConcHolder::GenerateOneHitString(), CConcHolder::GetContext(), InitDefaultOptions(), and LoadOptionsFromString().
whether to assume indexed data is utf8 encoded (default=no)
Referenced by CConcHolder::BuildJsonContextString(), CConcHolder::GenerateOneHitStringJson(), GetRegexOptions(), InitDefaultOptions(), and LoadOptionsFromString().
vector<size_t> CConcIndexator::m_IndicesToShow |
indices to show for Free_Index
Referenced by CConcHolder::BuildJsonContextString(), CConcHolder::GenerateHitStrings(), CConcHolder::GenerateOneHitStringJson(), CConcHolder::GetContext(), CConcHolder::GetContextJson(), GetTokenFields(), LoadOptionsFromString(), and SaveOptionsToString().