jml.topics
Class LdaGibbsSampler

java.lang.Object
  extended by jml.topics.LdaGibbsSampler

public class LdaGibbsSampler
extends java.lang.Object

Gibbs sampler for estimating the best assignments of topics for words and documents in a corpus. The algorithm is introduced in Tom Griffiths' paper "Gibbs sampling in the generative model of Latent Dirichlet Allocation" (2002).

Author:
heinrich

Field Summary
(package private)  double alpha
          Dirichlet parameter (document--topic associations)
(package private)  double beta
          Dirichlet parameter (topic--term associations)
private static int BURN_IN
          burn-in period
(package private)  Corpus corpus
           
private static int dispcol
           
(package private)  int[][] documents
          document data (word lists) documents[m][n] is the term index in the vocabulary for the n-th word of the m-th document
private static int ITERATIONS
          max iterations
(package private)  int K
          number of topics
(package private)  LDAOptions LDAOptions
           
(package private) static java.text.NumberFormat lnf
           
(package private)  int[][] nd
          nd[i][j] number of words in document i assigned to topic j.
(package private)  int[] ndsum
          ndsum[i] total number of words in document i.
(package private)  int numstats
          size of statistics
(package private)  int[][] nw
          nw[i][j] number of instances of term i assigned to topic j.
(package private)  int[] nwsum
          nwsum[j] total number of words assigned to topic j.
(package private)  double[][] phisum
          cumulative statistics of phi
private static int SAMPLE_LAG
          sample lag (if -1 only one sample taken)
(package private) static java.lang.String[] shades
           
(package private)  double[][] thetasum
          cumulative statistics of theta
private static int THIN_INTERVAL
          sampling lag (?)
(package private)  int V
          vocabulary size
(package private)  int[][] z
          topic assignments for each word.
 
Constructor Summary
LdaGibbsSampler()
           
LdaGibbsSampler(int[][] documents, int V)
          Initialize the Gibbs sampler with data.
LdaGibbsSampler(LDAOptions LDAOptions)
           
 
Method Summary
 void configure(int iterations, int burnIn, int thinInterval, int sampleLag)
          Configure the gibbs sampler
 void configure(LDAOptions LDAOptions)
           
 double[][] getPhi()
          Retrieve estimated topic--word associations.
 double[][] getTheta()
          Retrieve estimated document--topic associations.
 void gibbs(int K, double alpha, double beta)
          Main method: Select initial state ? Repeat a large number of times: 1.
static double[] hist(double[] data, int fmax)
          Print table of multinomial data
 int[][] initialState(int K)
          Initialization: Must start with an assignment of observations to topics ? Many alternatives are possible, I chose to perform random assignments with equal probabilities.
static void main(java.lang.String[] args)
          Driver with example data.
 void readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
          Load corpus and documents from a ArrayList<TreeMap<Integer, Integer>> instance.
 void readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
          Load corpus and documents from a text file located at String docTermCountFilePath.
 void readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
          Load corpus and documents from a LDAInput file.
 void readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
          Load corpus and documents from a RealMatrix instance.
 void run()
           
static void run(Corpus corpus, LDAOptions LDAOptions)
           
 void run(LDAOptions LDAOptions)
           
private  int sampleFullConditional(int m, int n)
          Sample a topic z_i from the full conditional distribution: p(z_i = j | z_-i, w) = (n_-i,j(w_i) + beta)/(n_-i,j(.) + W * beta) * (n_-i,j(d_i) + alpha)/(n_-i,.(d_i) + K * alpha)
static java.lang.String shadeDouble(double d, double max)
          create a string representation whose gray value appears as an indicator of magnitude, cf.
private  void updateParams()
          Add to the statistics the values of theta and phi for the current state.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

corpus

Corpus corpus

LDAOptions

LDAOptions LDAOptions

documents

int[][] documents
document data (word lists) documents[m][n] is the term index in the vocabulary for the n-th word of the m-th document


V

int V
vocabulary size


K

int K
number of topics


alpha

double alpha
Dirichlet parameter (document--topic associations)


beta

double beta
Dirichlet parameter (topic--term associations)


z

int[][] z
topic assignments for each word.


nw

int[][] nw
nw[i][j] number of instances of term i assigned to topic j.


nd

int[][] nd
nd[i][j] number of words in document i assigned to topic j.


nwsum

int[] nwsum
nwsum[j] total number of words assigned to topic j.


ndsum

int[] ndsum
ndsum[i] total number of words in document i.


thetasum

double[][] thetasum
cumulative statistics of theta


phisum

double[][] phisum
cumulative statistics of phi


numstats

int numstats
size of statistics


THIN_INTERVAL

private static int THIN_INTERVAL
sampling lag (?)


BURN_IN

private static int BURN_IN
burn-in period


ITERATIONS

private static int ITERATIONS
max iterations


SAMPLE_LAG

private static int SAMPLE_LAG
sample lag (if -1 only one sample taken)


dispcol

private static int dispcol

shades

static java.lang.String[] shades

lnf

static java.text.NumberFormat lnf
Constructor Detail

LdaGibbsSampler

public LdaGibbsSampler(int[][] documents,
                       int V)
Initialize the Gibbs sampler with data.

Parameters:
documents - a 2D integer array where documents[m][n] is the term index in the vocabulary for the n-th word of the m-th document. Indices always start from 0.
V - vocabulary size

LdaGibbsSampler

public LdaGibbsSampler(LDAOptions LDAOptions)

LdaGibbsSampler

public LdaGibbsSampler()
Method Detail

main

public static void main(java.lang.String[] args)
Driver with example data.

Parameters:
args -

readCorpusFromDocTermCountArray

public void readCorpusFromDocTermCountArray(java.util.ArrayList<java.util.TreeMap<java.lang.Integer,java.lang.Integer>> docTermCountArray)
Load corpus and documents from a ArrayList<TreeMap<Integer, Integer>> instance. Each element of the ArrayList is a doc-term count mapping.

Parameters:
docTermCountArray - A ArrayList<TreeMap<Integer, Integer>> instance, each element of the ArrayList records the doc-term count mapping for the corresponding document.

readCorpusFromLDAInputFile

public void readCorpusFromLDAInputFile(java.lang.String LDAInputDataFilePath)
Load corpus and documents from a LDAInput file.

Parameters:
LDAInputDataFilePath - The file path specifying the path of the LDAInput file.

readCorpusFromDocTermCountFile

public void readCorpusFromDocTermCountFile(java.lang.String docTermCountFilePath)
Load corpus and documents from a text file located at String docTermCountFilePath.

Parameters:
docTermCountFilePath - A String specifying the location of the text file holding doc-term-count matrix data.

readCorpusFromMatrix

public void readCorpusFromMatrix(org.apache.commons.math.linear.RealMatrix X)
Load corpus and documents from a RealMatrix instance.

Parameters:
X - a matrix with each column being a term count vector for a document with X(i, j) being the number of occurrence for the i-th vocabulary term in the j-th document

initialState

public int[][] initialState(int K)
Initialization: Must start with an assignment of observations to topics ? Many alternatives are possible, I chose to perform random assignments with equal probabilities.

Parameters:
K - number of topics
Returns:
assignment of topics to words

gibbs

public void gibbs(int K,
                  double alpha,
                  double beta)
Main method: Select initial state ? Repeat a large number of times: 1. Select an element 2. Update conditional on other elements. If appropriate, output summary for each run.

Parameters:
K - number of topics
alpha - symmetric prior parameter on document--topic associations
beta - symmetric prior parameter on topic--term associations

sampleFullConditional

private int sampleFullConditional(int m,
                                  int n)
Sample a topic z_i from the full conditional distribution: p(z_i = j | z_-i, w) = (n_-i,j(w_i) + beta)/(n_-i,j(.) + W * beta) * (n_-i,j(d_i) + alpha)/(n_-i,.(d_i) + K * alpha)

Parameters:
m - document
n - word

updateParams

private void updateParams()
Add to the statistics the values of theta and phi for the current state.


getTheta

public double[][] getTheta()
Retrieve estimated document--topic associations. If sample lag > 0 then the mean value of all sampled statistics for theta[][] is taken.

Returns:
theta multinomial mixture of document topics (M x K)

getPhi

public double[][] getPhi()
Retrieve estimated topic--word associations. If sample lag > 0 then the mean value of all sampled statistics for phi[][] is taken.

Returns:
phi multinomial mixture of topic words (K x V)

hist

public static double[] hist(double[] data,
                            int fmax)
Print table of multinomial data

Parameters:
data - vector of evidence
fmax - max frequency in display
Returns:
the scaled histogram bin values

configure

public void configure(int iterations,
                      int burnIn,
                      int thinInterval,
                      int sampleLag)
Configure the gibbs sampler

Parameters:
iterations - number of total iterations
burnIn - number of burn-in iterations
thinInterval - update statistics interval
sampleLag - sample interval (-1 for just one sample at the end)

configure

public void configure(LDAOptions LDAOptions)

run

public void run()

run

public void run(LDAOptions LDAOptions)

run

public static void run(Corpus corpus,
                       LDAOptions LDAOptions)

shadeDouble

public static java.lang.String shadeDouble(double d,
                                           double max)
create a string representation whose gray value appears as an indicator of magnitude, cf. Hinton diagrams in statistics.

Parameters:
d - value
max - maximum value
Returns:
a string representation for a value