LexRank Summarizer

This is a Spark-based extractive summarizer, based on the LexRank algorithm. It extracts a 5 sentence summary from each document in the corpus.

Boilerplate sentences are detected across all documents in the corpus using frequent sign-random projection locality-sensitive hashing (SRP-LSH) signatures, and detected within documents in the corpus using estimations of cosine similarity based on the SRP-LSH signatures.

For an explanation of the pooling trick used in this SRP-LSH implementation, see Online Generation of Locality Sensitive Hash Signatures.

Usage

Build a JAR file from the source with sbt assembly. Submit a job to Spark with:

spark-submit --class io.github.karlhigley.lexrank.Driver <path to jar file> [options]

Options:
-i PATH,  --input PATH         Relative path of input files (default: "./input")
-o PATH,  --output PATH        Relative path of output files (default: "./output")
-s VALUE, --stopwords VALUE    Number of stopwords to remove (default: 250)
-l VALUE, --length VALUE       Number of sentences to extract from each document (default: 5) 
-b VALUE, --boilerplate VALUE  Similarity cutoff for cross-document boilerplate filtering (default: 0.8)
-t VALUE, --threshold VALUE    Similarity threshold for LexRank graph construction (default: 0.1)
-c VALUE, --convergence VALUE  Convergence tolerance for PageRank graph computation (default: 0.001)

File Formats

The summarizer expects tab-separated text files with each document on a single line. Each line should contain a document identifier in the first column and the document text in the second column.

Outputs are formatted similarly, but with the text of a single sentence in the second column.