Need help with experimental-lda?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

dmlc
124 Stars 63 Forks Other 91 Commits 2 Opened issues

Services available

!
?

Need anything else?

Contributors list

Single Machine implementation of LDA

Modules

  1. parallelLDA
    contains various implementation of multi threaded LDA
  2. singleLDA
    contains various implementation of single threaded LDA
  3. topwords
    a tool to explore topics learnt by the LDA/HDP
  4. perplexity
    a tool to calculate perplexity on another dataset using word|topic matrix
  5. datagen
    packages txt files for our program
  6. preprocessing
    for converting from UCI or cLDA to simple txt file having one document per line

Organisation

  1. All codes are under
    src
    within respective folder
  2. For running Topic Models many template scripts are provided under
    scripts
  3. data
    is a placeholder folder where to put the data
  4. build
    and
    dist
    folder will be created to hold the executables

Requirements

  1. gcc >= 5.0 or Intel® C++ Compiler 2016 for using C++14 features
  2. split >= 8.21 (part of GNU coreutils)

How to use

We will show how to run our LDA on an UCI bag of words dataset

  1. First of all compile by hitting make
     make
  1. Download example dataset from UCI repository. For this a script has been provided.
     scripts/get_data.sh
  1. Prepare the data for our program
     scripts/prepare.sh data/nytimes 1

For other datasets replace nytimes with dataset name or location.

  1. Run LDA!
     scripts/lda_runner.sh

Inside the

lda_runner.sh
all the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under
out/
. Also you can specify which inference algorithm of LDA you want to run: 1.
simpleLDA
: Plain vanilla Gibbs sampling by Griffiths04 2.
sparseLDA
: Sparse LDA of Yao09 3.
aliasLDA
: Alias LDA 4.
FTreeLDA
: F++LDA (inspired from Yu14 5.
lightLDA
: light LDA of Yuan14

The make file has some useful features:

  • if you have Intel® C++ Compiler, then you can instead
     make intel
  • or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit
     make inteltogether
  • Also you can selectively compile individual modules by specifying
     make 
  • or clean individually by
     make clean-

Performance

Based on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.

Datasets

| Dataset | V | L | D | L/V | L/D | | ------------ | --------: | --------------: | -----------: | --------: | --------: | | NY Times | 101,330 | 99,542,127 | 299,753 | 982.36 | 332.08 | | PubMed | 141,043 | 737,869,085 | 8,200,000 | 5,231.52 | 89.98 | | Wikipedia | 210,218 | 1,614,349,889 | 3,731,325 | 7,679.41 | 432.65 |

Experimental datasets and their statistics.

V
denotes vocabulary size,
L
denotes the number of training tokens,
D
denotes the number of documents,
L/V
indicates the average number of occurrences of a word,
L/D
indicates the average length of a document.

log-Perplexity with time

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.