ca.uottawa.balie
Class TokenList

java.lang.Object
  extended by ca.uottawa.balie.TokenList
All Implemented Interfaces:
java.io.Serializable

public class TokenList
extends java.lang.Object
implements java.io.Serializable

List of Tokens to represent a text. Comes with a bunch of manipulation functions. Also an XML representation.

Author:
nadeaud
See Also:
Serialized Form

Constructor Summary
TokenList(boolean pi_DetectSentenceBoundaries, NamedEntityTypeEnumI[] pi_Types)
          Construct an empty TokenList.
 
Method Summary
 boolean Add(Token pi_Token, SentenceBoundariesRecognition pi_SBR, WekaLearner pi_SBRModel)
          Add a token a the end of the TokenList.
 boolean equals(java.lang.Object pi_Obj)
           
 Token Get(int pi_Index)
          Gets the token at the given index.
 int getSentenceCount()
          Gets the number of sentences found.
 java.util.Hashtable HashAccess()
          Get the map index-to-token
 int hashCode()
           
 TokenListIterator Iterator()
          Gets an iterator for the tokenList
 void MapNewNETypes(NamedEntityTypeEnumI[] pi_Mapping)
          Map new NE types.
 NamedEntityTypeEnumI[] NETagSet()
          Get the current NE tag set
 java.lang.String SentenceText(int pi_Index, boolean pi_Canonic, boolean pi_PrintNewLines)
          Gets the text version of the sentence at the given index.
 void SetEntityType(int pi_Index, NamedEntityType pi_Type)
          Set the type of an entity (deep copy)
 void SetPOS(int pi_Index, int pi_POS)
          Sets the Part-of-speech of the token at the given index.
 int Size()
          Gets the size (number of tokens) of the TokenList.
 java.util.Hashtable<java.lang.String,java.lang.Double> TermFrequencyTable()
          Gets the TF table.
 java.lang.String TokenRangeText(int pi_Start, int pi_Stop, boolean pi_Canonic, boolean pi_PrintNewLines, boolean pi_TagEntities, boolean pi_AddAlias, boolean pi_AddExplanation, boolean pi_EscapeXML)
          Get String representation of a part of the tokenlist
 java.lang.StringBuffer ToXML()
          Gets the tokenlist in XML format
 java.util.ArrayList<java.lang.String> WordList()
          Get the (ordered) list of words in this tokenlist
 
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TokenList

public TokenList(boolean pi_DetectSentenceBoundaries,
                 NamedEntityTypeEnumI[] pi_Types)
Construct an empty TokenList. Ready for incremental constitution.

Parameters:
pi_DetectSentenceBoundaries - True if the sentences boundaries must be detected
Method Detail

Add

public boolean Add(Token pi_Token,
                   SentenceBoundariesRecognition pi_SBR,
                   WekaLearner pi_SBRModel)
Add a token a the end of the TokenList.

Parameters:
pi_Token - A new token
pi_SBR - The SBR object
pi_SBRModel - The learned SBR model
Returns:
True if the previous token (current-1) was a sentence break

Size

public int Size()
Gets the size (number of tokens) of the TokenList.

Returns:
Size

Get

public Token Get(int pi_Index)
Gets the token at the given index.

Parameters:
pi_Index - Index of the token to get.
Returns:
A token

equals

public boolean equals(java.lang.Object pi_Obj)
Overrides:
equals in class java.lang.Object

hashCode

public int hashCode()
Overrides:
hashCode in class java.lang.Object

SentenceText

public java.lang.String SentenceText(int pi_Index,
                                     boolean pi_Canonic,
                                     boolean pi_PrintNewLines)
Gets the text version of the sentence at the given index.

Parameters:
pi_Index - Index of the sentence to get (in number of sentences)
pi_Canonic - True if the text must be returned in its canonical version
pi_PrintNewLines - Print \n characters
Returns:
The text of a sentence (String)

TokenRangeText

public java.lang.String TokenRangeText(int pi_Start,
                                       int pi_Stop,
                                       boolean pi_Canonic,
                                       boolean pi_PrintNewLines,
                                       boolean pi_TagEntities,
                                       boolean pi_AddAlias,
                                       boolean pi_AddExplanation,
                                       boolean pi_EscapeXML)
Get String representation of a part of the tokenlist

Parameters:
pi_Start - Start token number (inclusive)
pi_Stop - end token number (exclusive)
pi_Canonic - print in canonical (lowercased, etc) form
pi_PrintNewLines - print \n characters
pi_TagEntities - add XML tags around named entities
pi_AddAlias - add alias network infos in XML tag
pi_AddExplanation - add explanations infos in XML tag
pi_EscapeXML - escape XML reserved characters in the text (so that output is valid XML)
Returns:
textual representation of the tokenlist

MapNewNETypes

public void MapNewNETypes(NamedEntityTypeEnumI[] pi_Mapping)
Map new NE types. Override the current NE tag set

Parameters:
pi_Mapping -

NETagSet

public NamedEntityTypeEnumI[] NETagSet()
Get the current NE tag set

Returns:
NETagSet

getSentenceCount

public int getSentenceCount()
Gets the number of sentences found.

Returns:
Number of sentences.

TermFrequencyTable

public java.util.Hashtable<java.lang.String,java.lang.Double> TermFrequencyTable()
Gets the TF table. That is a lookup that maps words to their frequency in the text.

Returns:
Hashtable

HashAccess

public java.util.Hashtable HashAccess()
Get the map index-to-token

Returns:
Map token position to token

WordList

public java.util.ArrayList<java.lang.String> WordList()
Get the (ordered) list of words in this tokenlist

Returns:
list of words

SetPOS

public void SetPOS(int pi_Index,
                   int pi_POS)
Sets the Part-of-speech of the token at the given index.

Parameters:
pi_Index - Index of the token to update
pi_POS - Part-of-speech of this token (see TokenConsts for the enumeration)
See Also:
TokenConsts

SetEntityType

public void SetEntityType(int pi_Index,
                          NamedEntityType pi_Type)
Set the type of an entity (deep copy)

Parameters:
pi_Index - index of the entity
pi_Type - type to set

ToXML

public java.lang.StringBuffer ToXML()
Gets the tokenlist in XML format

Returns:
an XML StringBuffer

Iterator

public TokenListIterator Iterator()
Gets an iterator for the tokenList

Returns:
the iterator (type TokenListIterator)
See Also:
TokenListIterator