OnlineLDA_Spark

Online LDA based on Spark

The repository contains an implementation for online LDA from http://www.cs.princeton.edu/~mdhoffma/. Developed with Apache Spark v1.2.0. Share the code in case it may help.

main interfaces:

OnlineLDA_Spark.runOnlineMode(sc: SparkContext, paths: Seq[String], vocab: Map[String, Int], K: Int, batchSize: Int) and

OnlineLDA_Spark runBatchMode(sc: SparkContext, paths: Seq[String], vocab: Map[String, Int], K: Int, iterations: Int)

where paths are the files to be processed. For more details, refer to Driver.scala for examples.

Performance statistics, On a 4-node cluster, each with 16 cores and 30G memory. Without native BLAS installed. (for larger K, a native BLAS library will probably help)

data set from stackoverflow posts titles

processed 8 millions short articles in 15 minutes (with K = 10 and vocab size about 110K).

full English wiki

processed 5876K documents (avg length ~1000 words/per doc, 30G in total) in 2 hours and 17 minutes. word-topic sample:

(album,score,music,song,band,seed,web,team,single,length)
(align,center,style,left,small,text,width,background,class,color)
(team,football,season,player,goals,players,goal,image,time,years)
(year,news,publisher,years,time,books,book,state,government,work)
(subdivision,population,area,image,map,display,code,footnotes,region,caption)
(talk,span,color,font,page,style,discussion,deletion,small,made)
(author,year,volume,issue,math,species,target,citations,system,publisher)
(web,publisher,film,news,work,series,book,birth,author,language) ...