|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectca.uottawa.balie.Tokenizer
public class Tokenizer
The tokenizer takes a text in input and extract a tokenlist. It uses all services function, language dependant routines, SBD algo and POS tagging. The Tokenizer subsumes and extends the "StringTokenizer", a Java built-in utility.
Constructor Summary | |
---|---|
Tokenizer(java.lang.String pi_Language,
boolean pi_DetectSentenceBoundaries)
Construct tokenizer by specifying language and some options. |
Method Summary | |
---|---|
TokenList |
GetTokenList()
Gets the TokenList. |
java.lang.String |
Language()
Gets the tokenizer working language. |
void |
Reset()
Reset the Tokenizer in order to process another text. |
void |
Reset(java.lang.String pi_Language)
Reset the Tokenizer in order to process another text, in a different language. |
int |
SentenceCount()
Gets the number of sentences found. |
void |
SetCanonizerRules(int pi_Rules)
Set the rules the Canonizer will operates with. |
int |
TokenCount()
Gets the number of token found in the text. |
void |
Tokenize(java.lang.String pi_SourceText)
Tokenize a text, turning it in a TokenList |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public Tokenizer(java.lang.String pi_Language, boolean pi_DetectSentenceBoundaries)
pi_Language
- The language this Tokenizer will acceptpi_DetectSentenceBoundaries
- Wether or not sentences must be detectedMethod Detail |
---|
public void Tokenize(java.lang.String pi_SourceText)
TokenList
pi_SourceText
- text to tokenizepublic TokenList GetTokenList()
public int TokenCount()
public int SentenceCount()
public void SetCanonizerRules(int pi_Rules)
Canonizer
will operates with.
The default value is set at construction time and is something like:
Canonizer.RULE_LOWERCASE | Canonizer.RULE_REMOVE_INTERNAL_PUNCT | Canonizer.RULE_EXPAND_LIGATURES | Canonizer.RULE_NORMALIZE_PUNCT
pi_Rules
- Disjunction of required rulespublic void Reset()
public void Reset(java.lang.String pi_Language)
pi_Language
- New language the tokenizer will acceptpublic java.lang.String Language()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |