parallelLDAcontains various implementation of multi threaded LDA
singleLDAcontains various implementation of single threaded LDA
topwordsa tool to explore topics learnt by the LDA/HDP
perplexitya tool to calculate perplexity on another dataset using word|topic matrix
datagenpackages txt files for our program
preprocessingfor converting from UCI or cLDA to simple txt file having one document per line
srcwithin respective folder
scripts
datais a placeholder folder where to put the data
buildand
distfolder will be created to hold the executables
We will show how to run our LDA on an UCI bag of words dataset
make
scripts/get_data.sh
scripts/prepare.sh data/nytimes 1
For other datasets replace nytimes with dataset name or location.
scripts/lda_runner.sh
Inside the
lda_runner.shall the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under
out/. Also you can specify which inference algorithm of LDA you want to run: 1.
simpleLDA: Plain vanilla Gibbs sampling by Griffiths04 2.
sparseLDA: Sparse LDA of Yao09 3.
aliasLDA: Alias LDA 4.
FTreeLDA: F++LDA (inspired from Yu14 5.
lightLDA: light LDA of Yuan14
The make file has some useful features:
make intel
make inteltogether
make
make clean-
Based on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.
| Dataset | V | L | D | L/V | L/D | | ------------ | --------: | --------------: | -----------: | --------: | --------: | | NY Times | 101,330 | 99,542,127 | 299,753 | 982.36 | 332.08 | | PubMed | 141,043 | 737,869,085 | 8,200,000 | 5,231.52 | 89.98 | | Wikipedia | 210,218 | 1,614,349,889 | 3,731,325 | 7,679.41 | 432.65 |
Experimental datasets and their statistics.
Vdenotes vocabulary size,
Ldenotes the number of training tokens,
Ddenotes the number of documents,
L/Vindicates the average number of occurrences of a word,
L/Dindicates the average length of a document.