Build Status

Spark-RSVD

Spark-RSVD is a lib to compute approximate SVD decomposition of large sparse matrices (up to 100 million rows and columns) using an iterative algorithm for speed and efficiency.

The iterative algorithm is based on a random initial starting point, hence its name of Randomized SVD algorithm. It is described in [1]. A tree reduce algorithm is used for fast QR decomposition of tall and skinny matrices [2].

Inputs and outputs

Spark-RSVD expects a sparse matrix of the type BlockMatrix as defined here. To help creating such a matrix, a helper function (FromMatrixEntries code) is provided. It accepts an RDD of MatrixEntry, as defined by MLLib, which is a simple case class to indicate non-zero values in coordinate format.

Spark-RSVD also needs some other parameters as inputs which are explained in the parameters section below.

Spark-RSVD outputs the left singular vectors, the singular values and the right singular vectors (if needed). The singular values are stored as a Breeze DenseVector[Double]. The singular vectors are stored as a SkinnyBlockMatrix defined here.

Parameters

Spark-RSVD needs some parameters which are collected together in a case class called RSVDConfig. They are mostly self-explanatory, but here is a description of their meaning:

Sensible configuration

An important configuration part of Spark-RSVD library is the tuning how much memory per partition you should use. Using too much memory per Spark partition will likely results in tasks failing due to Out Of Memory errors, however we provide several configuration parameters to tune how the data should be distributed:

The most memory-intensive operation that occurs within Spark-RSVD computation is a distributed multiplication of a Sparse Matrix with a Dense "tall-and-skinny" matrix, so if we can ensure that this step works, the whole pipeline will likely succeed. Given the way we have chosen to distribute this operation, a given Spark partition will receive during this step the following inputs:

Let's compute the overall amount of data that a partition will receive for a given configuration. For this example we are choosing blockSize = 50000, partitionWidthInBlocks = 35, partitionHeightInBlocks = 10, embeddingDim = 100, oversample = 30 and we are assuming a density of 10-5 for the sparse matrix. This means that a partition will receive the following data during the multiplication step:

Data format

The sparse matrix given to the library should be stored in coordinate format. It expects an RDD of MatrixEntry (from Spark MLLib org.apache.spark.mllib.linalg.distributed), which stores the row and column indices as Long and the value as Double. See example below for further reference.

Example

Here is an example that will compute a 100-dimension embedding on a 200K * 200K matrix

import com.criteo.rsvd._

// Create spark context
val sc: SparkContext = new SparkContext(...)

// Create RSVD configuration
val config = RSVDConfig(
  embeddingDim = 100,
  oversample = 30,
  powerIter = 1,
  seed = 0,
  blockSize = 50000,
  partitionWidthInBlocks = 35,
  partitionHeightInBlocks = 10,
  computeLeftSingularVectors = true,
  computeRightSingularVectors = true
)

val matHeight = 200000 // 200K
val matWidth = 200000 // 200K
val numNonZeroEntries = 400000 // 400K

// Generate a sparse random matrix as an input (doesn't have to be symmetric)
val randomMatrixEntries = sc.parallelize(0 until numNonZeroEntries).map {
  idx =>
    val random = new Random(42 + idx)
    MatrixEntry(random.nextInt(matHeight), //row index
                random.nextInt(matWidth), //column index
                random.nextGaussian()) //entry value
}

val matrixToDecompose = BlockMatrix.fromMatrixEntries(randomMatrixEntries,
                                           matHeight = matHeight,
                                           matWidth = matWidth,
                                           config.blockSize,
                                           config.partitionHeightInBlocks,
                                           config.partitionWidthInBlocks)

val RsvdResults(leftSingularVectors, singularValues, rightSingularVectors) =
  RSVD.run(matrixToDecompose, config, sc)

// Print the top 100 (embeddingDim=100) singular values in decreasing order:
println(singularValues.toString())

// Fetch the left-singular vectors to driver, which will be a 200K x 100 matrix.
// This is available because we set config.computeLeftSingularVectors = true.
val leftSingularOnDriver = leftSingularVectors.get.toLocalMatrix

// Fetch the right-singular vectors to driver, which will be a 200K x 100 matrix.
// This is available because we set config.computeRightSingularVectors = true.
val rightSingularOnDriver = rightSingularVectors.get.toLocalMatrix

Installation

To use, simply add the following dependency to your gradle build file

compile "com.criteo:rsvd:1.0"

or the following dependency to your project's POM file

<dependency>
  <groupId>com.criteo</groupId>
  <artifactId>rsvd</artifactId>
  <version>1.0</version>
  <scope>compile</scope>
</dependency>

References:

[1] Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2), 217-288.

[2] Constantine, P. G., & Gleich, D. F. (2011, June). Tall and skinny QR factorizations in MapReduce architectures. In Proceedings of the second international workshop on MapReduce and its applications (pp. 43-50). ACM.

Authors of the initial commit (in alphabetical order):

License

This project is licensed under the Apache 2.0 license.

Copyright

Copyright © Criteo, 2018.