lucene::analysis::Token Class Reference

A Token is an occurence of a term from the text of a field. More...

#include <AnalysisHeader.h>

Public Member Functions

Token ()

~Token ()

Token (const TCHAR *text, const int32_t start, const int32_t end, const TCHAR *typ=defaultType)

Constructs a Token with the given text, start and end offsets, & type.

void set (const TCHAR *text, const int32_t start, const int32_t end, const TCHAR *typ=defaultType)

size_t bufferLength ()

void growBuffer (size_t size)

void setPositionIncrement (int32_t posIncr)

Set the position increment.

int32_t getPositionIncrement () const

const TCHAR * termText () const

size_t termTextLength ()

void resetTermTextLen ()

void setText (const TCHAR *txt)

int32_t startOffset () const

Returns this Token's starting offset, the position of the first character corresponding to this token in the source text.

void setStartOffset (int32_t val)

int32_t endOffset () const

Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text.

void setEndOffset (int32_t val)

const TCHAR * type () const

Returns this Token's lexical type. Defaults to "word".

void setType (const TCHAR *val)

returns reference

TCHAR * toString () const

Data Fields

TCHAR * _termText

the text of the term

int32_t _termTextLen

the length of termText. Internal use only

Static Public Attributes

static const TCHAR * defaultType

Detailed Description

A Token is an occurence of a term from the text of a field.

It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.

The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".

Constructor & Destructor Documentation

lucene::analysis::Token::Token ( )

lucene::analysis::Token::~Token ( )

lucene::analysis::Token::Token	(	const TCHAR *	text,
		const int32_t	start,
		const int32_t	end,
		const TCHAR *	typ = `defaultType`
	)

Constructs a Token with the given text, start and end offsets, & type.

Member Function Documentation

void lucene::analysis::Token::set	(	const TCHAR *	text,
		const int32_t	start,
		const int32_t	end,
		const TCHAR *	typ = `defaultType`
	)

size_t lucene::analysis::Token::bufferLength ( ) [inline]

void lucene::analysis::Token::growBuffer ( size_t size )

void lucene::analysis::Token::setPositionIncrement ( int32_t posIncr )

Set the position increment.

This determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.

The default value is 1.

Some common uses for this are:

Set it to zero to put multiple terms in the same position. This is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem's increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.

Set it to values greater than one to inhibit exact phrase matches. If, for example, one does not want phrases to match across removed stop words, then one could build a stop word filter that removes stop words and also sets the increment to the number of stop words removed before each non-stop word. Then exact phrase queries will only match when the terms occur with no intervening stop words.

int32_t lucene::analysis::Token::getPositionIncrement ( ) const

const TCHAR* lucene::analysis::Token::termText ( ) const

size_t lucene::analysis::Token::termTextLength ( )

void lucene::analysis::Token::resetTermTextLen ( )

void lucene::analysis::Token::setText ( const TCHAR * txt )

int32_t lucene::analysis::Token::startOffset ( ) const [inline]

Returns this Token's starting offset, the position of the first character corresponding to this token in the source text.

Note that the difference between endOffset() and startOffset() may not be equal to termText.length(), as the term text may have been altered by a stemmer or some other filter.

void lucene::analysis::Token::setStartOffset ( int32_t val ) [inline]

int32_t lucene::analysis::Token::endOffset ( ) const [inline]

Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text.

void lucene::analysis::Token::setEndOffset ( int32_t val ) [inline]

const TCHAR* lucene::analysis::Token::type ( ) const [inline]

Returns this Token's lexical type. Defaults to "word".

returns reference

void lucene::analysis::Token::setType ( const TCHAR * val ) [inline]

returns reference

TCHAR* lucene::analysis::Token::toString ( ) const

Field Documentation

TCHAR* lucene::analysis::Token::_termText

the text of the term

int32_t lucene::analysis::Token::_termTextLen

the length of termText. Internal use only

const TCHAR* lucene::analysis::Token::defaultType [static]

The documentation for this class was generated from the following file:

AnalysisHeader.h


Public Member Functions
	Token ()
	~Token ()
	Token (const TCHAR text, const int32_t start, const int32_t end, const TCHAR typ=defaultType)
	Constructs a Token with the given text, start and end offsets, & type.
void	set (const TCHAR text, const int32_t start, const int32_t end, const TCHAR typ=defaultType)
size_t	bufferLength ()
void	growBuffer (size_t size)
void	setPositionIncrement (int32_t posIncr)
	Set the position increment.
int32_t	getPositionIncrement () const
const TCHAR *	termText () const
size_t	termTextLength ()
void	resetTermTextLen ()
void	setText (const TCHAR *txt)
int32_t	startOffset () const
	Returns this Token's starting offset, the position of the first character corresponding to this token in the source text.
void	setStartOffset (int32_t val)
int32_t	endOffset () const
	Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text.
void	setEndOffset (int32_t val)
const TCHAR *	type () const
	Returns this Token's lexical type. Defaults to "word".
void	setType (const TCHAR *val)
	returns reference
TCHAR *	toString () const
Data Fields
TCHAR *	_termText
	the text of the term
int32_t	_termTextLen
	the length of termText. Internal use only
Static Public Attributes
static const TCHAR *	defaultType