|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjml.topics.Corpus
public class Corpus
A class to model corpus. Term indices always start from 0, and are
used to index elements in a 2D integer array. Term IDs always start
from 1, and are used in a Vector
of termID sequences.
Field Summary | |
---|---|
private java.util.Vector<java.util.Vector<java.lang.Integer>> |
corpus
A Vector of termID sequences. |
java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> |
docTermCountArray
A ArrayList of TreeMap storing the doc-term-count matrix. |
int[][] |
documents
2D integer array carrying the doc-term count matrix. |
static int |
IdxStart
The starting index for LDA_Blei input data. |
int |
nDoc
Number of documents in the corpus. |
int |
nTerm
Vocabulary size. |
Constructor Summary | |
---|---|
Corpus()
Constructor for the class Corpus . |
Method Summary | |
---|---|
void |
clearCorpus()
Clear corpus for class Corpus . |
void |
clearDocTermCountArray()
Clear docTermCountArray . |
static int[][] |
corpus2Documents(java.util.Vector<java.util.Vector<java.lang.Integer>> corpus)
Convert a Vector of termID sequences into a 2D doc-term
count array. |
static org.apache.commons.math.linear.RealMatrix |
documents2Matrix(int[][] documents)
Convert a 2D doc-term count array into a matrix. |
int[][] |
getDocuments()
Get the documents. |
static int |
getVocabularySize(int[][] documents)
Get the vocabulary size. |
void |
readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
Load corpus and documents from a ArrayList<TreeMap<Integer, Integer>> instance. |
void |
readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
Load corpus and documents from a text file located at String docTermCountFilePath. |
void |
readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
Load corpus and documents from a LDAInput file. |
void |
readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
Load corpus and documents from a RealMatrix instance. |
static void |
setLDATermIndexStart(int IdxStart)
Set term staring index for LDA input file. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static int IdxStart
private java.util.Vector<java.util.Vector<java.lang.Integer>> corpus
Vector
of termID sequences. Each element of the vector is a sequence
of termID (starting from 1) of a document. Each termID represents a corresponding
term in the vocabulary. For example, assume a term occurs in a document ten times,
then we have ten same termID for this term in the sequence.
public java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray
ArrayList
of TreeMap
storing the doc-term-count matrix.
The TreeMap
mapping a termID to its observed counts.
public int[][] documents
documents
.
public int nTerm
public int nDoc
Constructor Detail |
---|
public Corpus()
Corpus
.
Method Detail |
---|
public void clearCorpus()
Corpus
.
public void clearDocTermCountArray()
docTermCountArray
.
public int[][] getDocuments()
documents
.public void readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
corpus
and documents
from a LDAInput file.
LDAInputDataFilePath
- The file path specifying the path of the LDAInput file.public void readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
corpus
and documents
from a text file located at String
docTermCountFilePath.
docTermCountFilePath
- A String
specifying the location of the text file holding doc-term-count matrix data.public void readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
corpus
and documents
from a ArrayList<TreeMap<Integer, Integer>>
instance.
Each element of the ArrayList
is a doc-term count mapping.
docTermCountArray
- A ArrayList<TreeMap<Integer, Integer>>
instance,
each element of the ArrayList
records the doc-term
count mapping for the corresponding document.public void readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
corpus
and documents
from a RealMatrix
instance.
X
- a matrix with each column being a term count vector for a document
with X(i, j) being the number of occurrence for the i-th vocabulary
term in the j-th documentpublic static int[][] corpus2Documents(java.util.Vector<java.util.Vector<java.lang.Integer>> corpus)
Vector
of termID sequences into a 2D doc-term
count array. Term IDs always start from 1.
corpus
- a Vector
of termID sequences
public static org.apache.commons.math.linear.RealMatrix documents2Matrix(int[][] documents)
documents
- a 2D integer array carrying the doc-term
count matrix
public static int getVocabularySize(int[][] documents)
documents
- a 2D integer array where documents[m][n] is
the term index in the vocabulary for the n-th
word of the m-th document. Indices always start
from 0.
public static void setLDATermIndexStart(int IdxStart)
IdxStart
-
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |