CLucene - a full-featured, c++ search engine
API Documentation


lucene::analysis::Token Class Reference

A Token is an occurence of a term from the text of a field. More...

#include <AnalysisHeader.h>


Public Member Functions

 Token ()
 ~Token ()
 Token (const TCHAR *text, const int32_t start, const int32_t end, const TCHAR *typ=defaultType)
 Constructs a Token with the given text, start and end offsets, & type.
void set (const TCHAR *text, const int32_t start, const int32_t end, const TCHAR *typ=defaultType)
size_t bufferLength ()
void growBuffer (size_t size)
void setPositionIncrement (int32_t posIncr)
 Set the position increment.
int32_t getPositionIncrement () const
const TCHAR * termText () const
size_t termTextLength ()
void resetTermTextLen ()
void setText (const TCHAR *txt)
int32_t startOffset () const
 Returns this Token's starting offset, the position of the first character corresponding to this token in the source text.
void setStartOffset (int32_t val)
int32_t endOffset () const
 Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text.
void setEndOffset (int32_t val)
const TCHAR * type () const
 Returns this Token's lexical type. Defaults to "word".
void setType (const TCHAR *val)
 returns reference
TCHAR * toString () const

Data Fields

TCHAR * _termText
 the text of the term
int32_t _termTextLen
 the length of termText. Internal use only

Static Public Attributes

static const TCHAR * defaultType


Detailed Description

A Token is an occurence of a term from the text of a field.

It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.

The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".


Constructor & Destructor Documentation

lucene::analysis::Token::Token (  ) 

lucene::analysis::Token::~Token (  ) 

lucene::analysis::Token::Token ( const TCHAR *  text,
const int32_t  start,
const int32_t  end,
const TCHAR *  typ = defaultType 
)

Constructs a Token with the given text, start and end offsets, & type.


Member Function Documentation

void lucene::analysis::Token::set ( const TCHAR *  text,
const int32_t  start,
const int32_t  end,
const TCHAR *  typ = defaultType 
)

size_t lucene::analysis::Token::bufferLength (  )  [inline]

void lucene::analysis::Token::growBuffer ( size_t  size  ) 

void lucene::analysis::Token::setPositionIncrement ( int32_t  posIncr  ) 

Set the position increment.

This determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.

The default value is 1.

Some common uses for this are:

  • Set it to zero to put multiple terms in the same position. This is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem's increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.

  • Set it to values greater than one to inhibit exact phrase matches. If, for example, one does not want phrases to match across removed stop words, then one could build a stop word filter that removes stop words and also sets the increment to the number of stop words removed before each non-stop word. Then exact phrase queries will only match when the terms occur with no intervening stop words.

int32_t lucene::analysis::Token::getPositionIncrement (  )  const

const TCHAR* lucene::analysis::Token::termText (  )  const

size_t lucene::analysis::Token::termTextLength (  ) 

void lucene::analysis::Token::resetTermTextLen (  ) 

void lucene::analysis::Token::setText ( const TCHAR *  txt  ) 

int32_t lucene::analysis::Token::startOffset (  )  const [inline]

Returns this Token's starting offset, the position of the first character corresponding to this token in the source text.

Note that the difference between endOffset() and startOffset() may not be equal to termText.length(), as the term text may have been altered by a stemmer or some other filter.

void lucene::analysis::Token::setStartOffset ( int32_t  val  )  [inline]

int32_t lucene::analysis::Token::endOffset (  )  const [inline]

Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text.

void lucene::analysis::Token::setEndOffset ( int32_t  val  )  [inline]

const TCHAR* lucene::analysis::Token::type (  )  const [inline]

Returns this Token's lexical type. Defaults to "word".

returns reference

void lucene::analysis::Token::setType ( const TCHAR *  val  )  [inline]

returns reference

TCHAR* lucene::analysis::Token::toString (  )  const


Field Documentation

the text of the term

the length of termText. Internal use only

const TCHAR* lucene::analysis::Token::defaultType [static]


The documentation for this class was generated from the following file:

clucene.sourceforge.net