|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjml.topics.LdaGibbsSampler
public class LdaGibbsSampler
Gibbs sampler for estimating the best assignments of topics for words and documents in a corpus. The algorithm is introduced in Tom Griffiths' paper "Gibbs sampling in the generative model of Latent Dirichlet Allocation" (2002).
Field Summary | |
---|---|
(package private) double |
alpha
Dirichlet parameter (document--topic associations) |
(package private) double |
beta
Dirichlet parameter (topic--term associations) |
private static int |
BURN_IN
burn-in period |
(package private) Corpus |
corpus
|
private static int |
dispcol
|
(package private) int[][] |
documents
document data (word lists) documents[m][n] is the term index in the vocabulary for the n-th word of the m-th document |
private static int |
ITERATIONS
max iterations |
(package private) int |
K
number of topics |
(package private) LDAOptions |
LDAOptions
|
(package private) static java.text.NumberFormat |
lnf
|
(package private) int[][] |
nd
nd[i][j] number of words in document i assigned to topic j. |
(package private) int[] |
ndsum
ndsum[i] total number of words in document i. |
(package private) int |
numstats
size of statistics |
(package private) int[][] |
nw
nw[i][j] number of instances of term i assigned to topic j. |
(package private) int[] |
nwsum
nwsum[j] total number of words assigned to topic j. |
(package private) double[][] |
phisum
cumulative statistics of phi |
private static int |
SAMPLE_LAG
sample lag (if -1 only one sample taken) |
(package private) static java.lang.String[] |
shades
|
(package private) double[][] |
thetasum
cumulative statistics of theta |
private static int |
THIN_INTERVAL
sampling lag (?) |
(package private) int |
V
vocabulary size |
(package private) int[][] |
z
topic assignments for each word. |
Constructor Summary | |
---|---|
LdaGibbsSampler()
|
|
LdaGibbsSampler(int[][] documents,
int V)
Initialize the Gibbs sampler with data. |
|
LdaGibbsSampler(LDAOptions LDAOptions)
|
Method Summary | |
---|---|
void |
configure(int iterations,
int burnIn,
int thinInterval,
int sampleLag)
Configure the gibbs sampler |
void |
configure(LDAOptions LDAOptions)
|
double[][] |
getPhi()
Retrieve estimated topic--word associations. |
double[][] |
getTheta()
Retrieve estimated document--topic associations. |
void |
gibbs(int K,
double alpha,
double beta)
Main method: Select initial state ? Repeat a large number of times: 1. |
static double[] |
hist(double[] data,
int fmax)
Print table of multinomial data |
int[][] |
initialState(int K)
Initialization: Must start with an assignment of observations to topics ? Many alternatives are possible, I chose to perform random assignments with equal probabilities. |
static void |
main(java.lang.String[] args)
Driver with example data. |
void |
readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
Load corpus and documents from a ArrayList<TreeMap<Integer, Integer>> instance. |
void |
readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
Load corpus and documents from a text file located at String docTermCountFilePath. |
void |
readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
Load corpus and documents from a LDAInput file. |
void |
readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
Load corpus and documents from a RealMatrix instance. |
void |
run()
|
static void |
run(Corpus corpus,
LDAOptions LDAOptions)
|
void |
run(LDAOptions LDAOptions)
|
private int |
sampleFullConditional(int m,
int n)
Sample a topic z_i from the full conditional distribution: p(z_i = j | z_-i, w) = (n_-i,j(w_i) + beta)/(n_-i,j(.) + W * beta) * (n_-i,j(d_i) + alpha)/(n_-i,.(d_i) + K * alpha) |
static java.lang.String |
shadeDouble(double d,
double max)
create a string representation whose gray value appears as an indicator of magnitude, cf. |
private void |
updateParams()
Add to the statistics the values of theta and phi for the current state. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
Corpus corpus
LDAOptions LDAOptions
int[][] documents
int V
int K
double alpha
double beta
int[][] z
int[][] nw
int[][] nd
int[] nwsum
int[] ndsum
double[][] thetasum
double[][] phisum
int numstats
private static int THIN_INTERVAL
private static int BURN_IN
private static int ITERATIONS
private static int SAMPLE_LAG
private static int dispcol
static java.lang.String[] shades
static java.text.NumberFormat lnf
Constructor Detail |
---|
public LdaGibbsSampler(int[][] documents, int V)
documents
- a 2D integer array where documents[m][n] is
the term index in the vocabulary for the n-th
word of the m-th document. Indices always start
from 0.V
- vocabulary sizepublic LdaGibbsSampler(LDAOptions LDAOptions)
public LdaGibbsSampler()
Method Detail |
---|
public static void main(java.lang.String[] args)
args
- public void readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
corpus
and documents
from a ArrayList<TreeMap<Integer, Integer>>
instance.
Each element of the ArrayList
is a doc-term count mapping.
docTermCountArray
- A ArrayList<TreeMap<Integer, Integer>>
instance,
each element of the ArrayList
records the doc-term
count mapping for the corresponding document.public void readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
corpus
and documents
from a LDAInput file.
LDAInputDataFilePath
- The file path specifying the path of the LDAInput file.public void readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
corpus
and documents
from a text file located at String
docTermCountFilePath.
docTermCountFilePath
- A String
specifying the location of the text file holding doc-term-count matrix data.public void readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
corpus
and documents
from a RealMatrix
instance.
X
- a matrix with each column being a term count vector for a document
with X(i, j) being the number of occurrence for the i-th vocabulary
term in the j-th documentpublic int[][] initialState(int K)
K
- number of topics
public void gibbs(int K, double alpha, double beta)
K
- number of topicsalpha
- symmetric prior parameter on document--topic associationsbeta
- symmetric prior parameter on topic--term associationsprivate int sampleFullConditional(int m, int n)
m
- documentn
- wordprivate void updateParams()
public double[][] getTheta()
public double[][] getPhi()
public static double[] hist(double[] data, int fmax)
data
- vector of evidencefmax
- max frequency in display
public void configure(int iterations, int burnIn, int thinInterval, int sampleLag)
iterations
- number of total iterationsburnIn
- number of burn-in iterationsthinInterval
- update statistics intervalsampleLag
- sample interval (-1 for just one sample at the end)public void configure(LDAOptions LDAOptions)
public void run()
public void run(LDAOptions LDAOptions)
public static void run(Corpus corpus, LDAOptions LDAOptions)
public static java.lang.String shadeDouble(double d, double max)
d
- valuemax
- maximum value
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |