ca.uottawa.balie
Class Tokenizer

java.lang.Object
  extended by ca.uottawa.balie.Tokenizer

public class Tokenizer
extends java.lang.Object

The tokenizer takes a text in input and extract a tokenlist. It uses all services function, language dependant routines, SBD algo and POS tagging. The Tokenizer subsumes and extends the "StringTokenizer", a Java built-in utility.

Author:
nadeaud

Constructor Summary
Tokenizer(java.lang.String pi_Language, boolean pi_DetectSentenceBoundaries)
          Construct tokenizer by specifying language and some options.
 
Method Summary
 TokenList GetTokenList()
          Gets the TokenList.
 java.lang.String Language()
          Gets the tokenizer working language.
 void Reset()
          Reset the Tokenizer in order to process another text.
 void Reset(java.lang.String pi_Language)
          Reset the Tokenizer in order to process another text, in a different language.
 int SentenceCount()
          Gets the number of sentences found.
 void SetCanonizerRules(int pi_Rules)
          Set the rules the Canonizer will operates with.
 int TokenCount()
          Gets the number of token found in the text.
 void Tokenize(java.lang.String pi_SourceText)
          Tokenize a text, turning it in a TokenList
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tokenizer

public Tokenizer(java.lang.String pi_Language,
                 boolean pi_DetectSentenceBoundaries)
Construct tokenizer by specifying language and some options.

Parameters:
pi_Language - The language this Tokenizer will accept
pi_DetectSentenceBoundaries - Wether or not sentences must be detected
Method Detail

Tokenize

public void Tokenize(java.lang.String pi_SourceText)
Tokenize a text, turning it in a TokenList

Parameters:
pi_SourceText - text to tokenize

GetTokenList

public TokenList GetTokenList()
Gets the TokenList.

Returns:
TokenList

TokenCount

public int TokenCount()
Gets the number of token found in the text.

Returns:
Number of tokens

SentenceCount

public int SentenceCount()
Gets the number of sentences found.

Returns:
Number of sentences.

SetCanonizerRules

public void SetCanonizerRules(int pi_Rules)
Set the rules the Canonizer will operates with. The default value is set at construction time and is something like: Canonizer.RULE_LOWERCASE | Canonizer.RULE_REMOVE_INTERNAL_PUNCT | Canonizer.RULE_EXPAND_LIGATURES | Canonizer.RULE_NORMALIZE_PUNCT

Parameters:
pi_Rules - Disjunction of required rules

Reset

public void Reset()
Reset the Tokenizer in order to process another text. Can be called on a tokenizer even if no text has been processed.


Reset

public void Reset(java.lang.String pi_Language)
Reset the Tokenizer in order to process another text, in a different language. Can be called on a tokenizer even if no text has been processed.

Parameters:
pi_Language - New language the tokenizer will accept

Language

public java.lang.String Language()
Gets the tokenizer working language.

Returns:
the language this tokenizer can handle