CLucene - a full-featured, c++ search engine
API Documentation
#include <IndexWriter.h>
Public Member Functions | |
~IndexWriter () | |
LUCENE_STATIC_CONSTANT (int32_t, DEFAULT_MAX_FIELD_LENGTH=10000) | |
The Java implementation of Lucene silently truncates any tokenized field if the number of tokens exceeds a certain threshold. | |
LUCENE_STATIC_CONSTANT (int32_t, FIELD_TRUNC_POLICY__WARN=-1) | |
int32_t | getMaxFieldLength () const |
void | setMaxFieldLength (int32_t val) |
LUCENE_STATIC_CONSTANT (int32_t, DEFAULT_MAX_BUFFERED_DOCS=10) | |
Default value is 10. | |
void | setMaxBufferedDocs (int32_t val) |
Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created. | |
int32_t | getMaxBufferedDocs () |
LUCENE_STATIC_CONSTANT (int64_t, WRITE_LOCK_TIMEOUT=1000) | |
Default value for the write lock timeout (1,000). | |
void | setWriteLockTimeout (int64_t writeLockTimeout) |
Sets the maximum time to wait for a write lock (in milliseconds). | |
int64_t | getWriteLockTimeout () |
LUCENE_STATIC_CONSTANT (int64_t, COMMIT_LOCK_TIMEOUT=10000) | |
Default value for the commit lock timeout (10,000). | |
void | setCommitLockTimeout (int64_t commitLockTimeout) |
Sets the maximum time to wait for a commit lock (in milliseconds). | |
int64_t | getCommitLockTimeout () |
LUCENE_STATIC_CONSTANT (int32_t, DEFAULT_MERGE_FACTOR=10) | |
Default value is 10. | |
int32_t | getMergeFactor () const |
void | setMergeFactor (int32_t val) |
LUCENE_STATIC_CONSTANT (int32_t, DEFAULT_TERM_INDEX_INTERVAL=128) | |
Expert: The fraction of terms in the "dictionary" which should be stored in RAM. | |
void | setTermIndexInterval (int32_t interval) |
Expert: Set the interval between indexed terms. | |
int32_t | getTermIndexInterval () |
Expert: Return the interval between indexed terms. | |
int32_t | getMinMergeDocs () const |
Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created. | |
void | setMinMergeDocs (int32_t val) |
LUCENE_STATIC_CONSTANT (int32_t, DEFAULT_MAX_MERGE_DOCS=0x7FFFFFFFL) | |
Determines the largest number of documents ever merged by addDocument(). | |
int32_t | getMaxMergeDocs () const |
Determines the largest number of documents ever merged by addDocument(). | |
void | setMaxMergeDocs (int32_t val) |
IndexWriter (const char *path, lucene::analysis::Analyzer *a, const bool create, const bool closeDir=true) | |
Constructs an IndexWriter for the index in path . | |
IndexWriter (lucene::store::Directory *d, lucene::analysis::Analyzer *a, const bool create, const bool closeDir=false) | |
Constructs an IndexWriter for the index in d . | |
void | close () |
Flushes all changes to an index, closes all associated files, and closes the directory that the index is stored in. | |
int32_t | docCount () |
Returns the number of documents currently in this index. | |
void | addDocument (lucene::document::Document *doc, lucene::analysis::Analyzer *analyzer=NULL) |
Adds a document to this index, using the provided analyzer instead of the value of getAnalyzer(). | |
void | optimize () |
Merges all segments together into a single segment, optimizing an index for search. | |
void | addIndexes (lucene::store::Directory **dirs) |
Merges all segments from an array of indices into this index. | |
void | addIndexes (IndexReader **readers) |
Merges the provided indexes into this index. | |
lucene::store::Directory * | getDirectory () |
Returns the directory this index resides in. | |
bool | getUseCompoundFile () |
Get the current setting of whether to use the compound file format. | |
void | setUseCompoundFile (bool value) |
Setting to turn on usage of a compound file. | |
void | setSimilarity (lucene::search::Similarity *similarity) |
Expert: Set the Similarity implementation used by this IndexWriter. | |
lucene::search::Similarity * | getSimilarity () |
Expert: Return the Similarity implementation used by this IndexWriter. | |
lucene::analysis::Analyzer * | getAnalyzer () |
Returns the analyzer used by this index. | |
Data Fields | |
SegmentInfos * | segmentInfos |
Static Public Attributes | |
static const char * | WRITE_LOCK_NAME |
static const char * | COMMIT_LOCK_NAME |
Friends | |
class | LockWith2 |
class | LockWithCFS |
The third argument to the constructor determines whether a new index is created, or whether an existing index is opened for the addition of new documents.
In either case, documents are added with the addDocument method. When finished adding documents, close should be called.
If an index will not have more documents added for a while and optimal search performance is desired, then the optimize method should be called before the index is closed.
Opening an IndexWriter creates a lock file for the directory in use. Trying to open another IndexWriter on the same directory will lead to an IOException. The IOException is also thrown if an IndexReader on the same directory is used to delete documents from the index.
lucene::index::IndexWriter::~IndexWriter | ( | ) |
lucene::index::IndexWriter::IndexWriter | ( | const char * | path, | |
lucene::analysis::Analyzer * | a, | |||
const bool | create, | |||
const bool | closeDir = true | |||
) |
Constructs an IndexWriter for the index in path
.
Text will be analyzed with a
. If create
is true, then a new, empty index will be created in path
, replacing the index already there, if any.
path | the path to the index directory | |
a | the analyzer to use | |
create | true to create the index or overwrite the existing one; false to append to the existing index |
IOException | if the directory cannot be read/written to, or if it does not exist, and create is false |
lucene::index::IndexWriter::IndexWriter | ( | lucene::store::Directory * | d, | |
lucene::analysis::Analyzer * | a, | |||
const bool | create, | |||
const bool | closeDir = false | |||
) |
Constructs an IndexWriter for the index in d
.
Text will be analyzed with a
. If create
is true, then a new, empty index will be created in d
, replacing the index already there, if any.
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int32_t | , | |
DEFAULT_MAX_FIELD_LENGTH | = 10000 | |||
) |
The Java implementation of Lucene silently truncates any tokenized field if the number of tokens exceeds a certain threshold.
Although that threshold is adjustable, it is easy for the client programmer to be unaware that such a threshold exists, and to become its unwitting victim. CLucene implements a less insidious truncation policy. Up to DEFAULT_MAX_FIELD_LENGTH tokens, CLucene behaves just as JLucene does. If the number of tokens exceeds that threshold without any indication of a truncation preference by the client programmer, CLucene raises an exception, prompting the client programmer to explicitly set a truncation policy by adjusting maxFieldLength.
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int32_t | , | |
FIELD_TRUNC_POLICY__WARN | = -1 | |||
) |
int32_t lucene::index::IndexWriter::getMaxFieldLength | ( | ) | const [inline] |
void lucene::index::IndexWriter::setMaxFieldLength | ( | int32_t | val | ) | [inline] |
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int32_t | , | |
DEFAULT_MAX_BUFFERED_DOCS | = 10 | |||
) |
Default value is 10.
Change using setMaxBufferedDocs(int).
void lucene::index::IndexWriter::setMaxBufferedDocs | ( | int32_t | val | ) | [inline] |
Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created.
Since Documents are merged in a RAMDirectory, large value gives faster indexing. At the same time, mergeFactor limits the number of files open in a FSDirectory.
The default value is DEFAULT_MAX_BUFFERED_DOCS.
int32_t lucene::index::IndexWriter::getMaxBufferedDocs | ( | ) | [inline] |
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int64_t | , | |
WRITE_LOCK_TIMEOUT | = 1000 | |||
) |
Default value for the write lock timeout (1,000).
void lucene::index::IndexWriter::setWriteLockTimeout | ( | int64_t | writeLockTimeout | ) | [inline] |
Sets the maximum time to wait for a write lock (in milliseconds).
int64_t lucene::index::IndexWriter::getWriteLockTimeout | ( | ) | [inline] |
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int64_t | , | |
COMMIT_LOCK_TIMEOUT | = 10000 | |||
) |
Default value for the commit lock timeout (10,000).
void lucene::index::IndexWriter::setCommitLockTimeout | ( | int64_t | commitLockTimeout | ) | [inline] |
Sets the maximum time to wait for a commit lock (in milliseconds).
int64_t lucene::index::IndexWriter::getCommitLockTimeout | ( | ) | [inline] |
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int32_t | , | |
DEFAULT_MERGE_FACTOR | = 10 | |||
) |
Default value is 10.
Change using setMergeFactor(int).
int32_t lucene::index::IndexWriter::getMergeFactor | ( | ) | const [inline] |
void lucene::index::IndexWriter::setMergeFactor | ( | int32_t | val | ) | [inline] |
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int32_t | , | |
DEFAULT_TERM_INDEX_INTERVAL | = 128 | |||
) |
Expert: The fraction of terms in the "dictionary" which should be stored in RAM.
Smaller values use more memory, but make searching slightly faster, while larger values use less memory and make searching slightly slower. Searching is typically not dominated by dictionary lookup, so tweaking this is rarely useful.
void lucene::index::IndexWriter::setTermIndexInterval | ( | int32_t | interval | ) | [inline] |
Expert: Set the interval between indexed terms.
Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms.
This parameter determines the amount of computation required per query term, regardless of the number of documents that contain that term. In particular, it is the maximum number of other terms that must be scanned before a term is located and its frequency and position information may be processed. In a large index with user-entered query terms, query processing time is likely to be dominated not by term lookup but rather by the processing of frequency and positional data. In a small index or when many uncommon query terms are generated (e.g., by wildcard queries) term lookup may become a dominant cost.
In particular, numUniqueTerms/interval
terms are read into memory by an IndexReader, and, on average, interval/2
terms must be scanned for each random term access.
int32_t lucene::index::IndexWriter::getTermIndexInterval | ( | ) | [inline] |
Expert: Return the interval between indexed terms.
int32_t lucene::index::IndexWriter::getMinMergeDocs | ( | ) | const [inline] |
Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created.
Since Documents are merged in a RAMDirectory, large value gives faster indexing. At the same time, mergeFactor limits the number of files open in a FSDirectory.
The default value is 10.
void lucene::index::IndexWriter::setMinMergeDocs | ( | int32_t | val | ) | [inline] |
lucene::index::IndexWriter::LUCENE_STATIC_CONSTANT | ( | int32_t | , | |
DEFAULT_MAX_MERGE_DOCS | = 0x7FFFFFFFL | |||
) |
Determines the largest number of documents ever merged by addDocument().
Small values (e.g., less than 10,000) are best for interactive indexing, as this limits the length of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches.
The default value is DEFAULT_MAX_MERGE_DOCS.
int32_t lucene::index::IndexWriter::getMaxMergeDocs | ( | ) | const [inline] |
Determines the largest number of documents ever merged by addDocument().
Small values (e.g., less than 10,000) are best for interactive indexing, as this limits the length of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches.
The default value is Integer#MAX_VALUE.
void lucene::index::IndexWriter::setMaxMergeDocs | ( | int32_t | val | ) | [inline] |
void lucene::index::IndexWriter::close | ( | ) |
int32_t lucene::index::IndexWriter::docCount | ( | ) |
Returns the number of documents currently in this index.
synchronized
void lucene::index::IndexWriter::addDocument | ( | lucene::document::Document * | doc, | |
lucene::analysis::Analyzer * | analyzer = NULL | |||
) |
Adds a document to this index, using the provided analyzer instead of the value of getAnalyzer().
If the document contains more than setMaxFieldLength(int) terms for a given field, the remainder are discarded.
void lucene::index::IndexWriter::optimize | ( | ) |
void lucene::index::IndexWriter::addIndexes | ( | lucene::store::Directory ** | dirs | ) |
Merges all segments from an array of indices into this index.
This may be used to parallelize batch indexing. A large document collection can be broken into sub-collections. Each sub-collection can be indexed in parallel, on a different thread, process or machine. The complete index can then be created by merging sub-collection indices with this method.
After this completes, the index is optimized.
void lucene::index::IndexWriter::addIndexes | ( | IndexReader ** | readers | ) |
lucene:: store ::Directory* lucene::index::IndexWriter::getDirectory | ( | ) | [inline] |
Returns the directory this index resides in.
bool lucene::index::IndexWriter::getUseCompoundFile | ( | ) | [inline] |
Get the current setting of whether to use the compound file format.
Note that this just returns the value you set with setUseCompoundFile(boolean) or the default. You cannot use this to query the status of an existing index.
void lucene::index::IndexWriter::setUseCompoundFile | ( | bool | value | ) | [inline] |
Setting to turn on usage of a compound file.
When on, multiple files for each segment are merged into a single file once the segment creation is finished. This is done regardless of what directory is in use.
void lucene::index::IndexWriter::setSimilarity | ( | lucene::search::Similarity * | similarity | ) | [inline] |
Expert: Set the Similarity implementation used by this IndexWriter.
lucene:: search ::Similarity* lucene::index::IndexWriter::getSimilarity | ( | ) | [inline] |
Expert: Return the Similarity implementation used by this IndexWriter.
This defaults to the current value of Similarity#getDefault().
lucene:: analysis ::Analyzer* lucene::index::IndexWriter::getAnalyzer | ( | ) | [inline] |
Returns the analyzer used by this index.
friend class LockWith2 [friend] |
friend class LockWithCFS [friend] |
SegmentInfos* lucene::index::IndexWriter::segmentInfos |
const char* lucene::index::IndexWriter::WRITE_LOCK_NAME [static] |
const char* lucene::index::IndexWriter::COMMIT_LOCK_NAME [static] |