PLDA: a parallel C/C++ implementation of Latent Dirichlet Allocation (LDA). This code is licensed under the Mozilla Public License (MPL). This code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. James Petterson james.petterson@nicta.com.au http://users.rsise.anu.edu.au/~jpetterson/ Intro: ====== This code implements a collapsed Gibbs sampler for LDA. It can be run standalone, in one machine, or in a cluster, sharing the processing task among several CPUs. The parallel implementation is completely asynchronous, and doesn't require any external libraries - everything is done using a shared filesystem. I wrote this code to experiment with a simple idea to reduce the synchronization time when running in a cluster. You can find all details here: http://arxiv.org/0909.4603 Supported Platforms: ==================== This code has been developed and tested on Linux (Ubuntu 8.04 - Hardy) and Mac OS X (10.5.7), but should work with minor modifications in other platforms. The only external libraries used are the standard C and Math libraries (libc and libm). Compiling the C++ code: ======================= Since there are no special external library requirements, all you have to do is to type 'make' in a shell. Optionally, the code can be compiled with debugging information by typing 'make dbg=1'. Executing: ========== To run the code type ./plda [-k num_topics] [-v dictionary_size] [-a alpha] [-e eta] [-r random_generator_seed] [-n num_iterations] [-d data_file] [-i input_model_file] [-o output_model_file] [-s shared_dir] [-c cpu_number] [-C number_of_cpus] [-t sparsity_threshold] There are a lot of options here, so it might be more interesting to first read the examples (below). Here is a description of all parameters: num_topics: number of topics of the LDA model dictionary_size: number of different words in the corpus; if set to zero it will be learned from the dataset; not, however, that when learning the model with one dataset and doing inference in another it is necessary to have the same value here in both (more about this later...) alpha: Dirichlet prior for the document-topic distributions eta: Dirichlet prior for the term-topic distributions random_generator_seed: seed for the random generator; useful if you need repeatability in your experiments num_iterations: number of iterations of the Gibbs sampling data_file: input data file (see notes about the format below) input_model_file: only when doing inference - reads input_model_file.nkv output_model_file: where to save the trained model; creates 3 files: output_model_file.beta, output_model_file.theta and output_model_file.nkv (see "Model files" below) The last 4 parameters are used only when running in more than one CPU: shared_dir: directory to be used to share information among all CPUs; all CPUs must be able to read and write there cpu_number: each process must have a CPU number, from 0 to number_of_cpus-1 number_of_cpus: number of CPUs that are sharing the task sparsity_threshold: level of sparsity used when sharing information among CPUs (from 0 to 1); see paper for details (link in Intro, above) Model files: ============ At the end of execution 3 files are created (here 'model' stands for the value of the -o option): model.beta: the learned term-topic distribution - each line corresponds to the term distribution of a topic. model.theta: the learned document-topic distribution - each line corresponds to the topic distribution of a document. model.nkv: the term-topic counts; this is all that is loaded when the -i option is used. Data format: ============ We use the same data format as in David Blei's code (http://www.cs.princeton.edu.au/~blei/lda-c). Each document is represented as a sparse vector of word counts. The data is a file where each line is of the form: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string. In the scripts directory we provide a ruby script (convert.rb) to convert from the docword format, used in the "Bag of Words Data Set" in the UCI repository (http://archive.ici.uci.edu/ml/datasets/Bag+of+Words). Example use: ruby scripts/convert.rb docword.nips.txt > nips.txt Note: if you have a relatively new linux or mac os installation you should already have ruby. If you don't, check http://www.ruby-lang.org. Examples: ========= Single CPU training and testing: -------------------------------- Let's assume we have the dataset split into two files: data_train.txt and data_test.txt. Note: you can split a dataset using the provided split.rb script. E.g.: to split in 90% for training and 10% for testing: ruby scripts/split.rb data.txt 0.9 To learn the model: ./plda -k 10 -a 0.1 -e 0.01 -n 1500 -d data_train.txt -o model_train To do inference: ./plda -k 10 -a 0.1 -e 0.01 -n 1500 -d data_test.txt -i model_train -o model_test Note: if the number of different terms (V) is not the same in both files it is necessary to manually set it with the -v option. Multiple CPU training and testing: ---------------------------------- Let's assume we are running in 10 CPUs. First it is necessary to split the training data in 10 subsets. You can use the cpu_split.rb script for that: ruby scripts/cpu_split.rb data_train.txt 10 Now, to learn the model you need to start 10 processes. They can be in different machines, as long as they can all share a filesystem directory. The first process will be: ./plda -k 10 -a 0.1 -e 0.01 -n 1500 -d data_train.txt.1_of_10 -o model_train_1 -s some_dir -c 0 -C 10 -t 0.1 The second one: ./plda -k 10 -a 0.1 -e 0.01 -n 1500 -d data_train.txt.2_of_10 -o model_train_2 -s some_dir -c 1 -C 10 -t 0.1 and so on... (sorry, no scripts provided for that) They will all run independently, and at the end they should all have the same shared model (.nkv file). The .theta files, however, are going to be different, since they correspond to different documents. To do inference you can choose any one of the CPUs (since the final model should be the same in all of them): ./plda -k 10 -a 0.1 -e 0.01 -n 1500 -d data_test.txt -i model_train_1 -o model_test