PySpark Cassandra

This PySpark Cassandra repository is no longer maintained. Please check this repository for Spark 2.0+ support: https://github.com/anguenot/pyspark-cassandra


Build Status Codacy Badge

PySpark Cassandra brings back the fun in working with Cassandra data in PySpark.

This module provides python support for Apache Spark's Resillient Distributed Datasets from Apache Cassandra CQL rows using Cassandra Spark Connector within PySpark, both in the interactive shell and in python programmes submitted with spark-submit.

This project was initially forked from https://github.com/Parsely/pyspark-cassandra, but in order to submit it to http://spark-packages.org/, a plain old repository was created.

Contents:

Compatibility

Feedback on (in-)compatibility is much appreciated.

Spark

The current version of PySpark Cassandra is succesfully used with Spark version 1.5 and 1.6. Use older versions for Spark 1.2, 1.3 or 1.4.

Cassandra

PySpark Cassandra is compatible with Cassandra:

Python

PySpark Cassandra is used with python 2.7, python 3.3 and 3.4.

Scala

PySpark Cassandra is currently only packaged for Scala 2.10

Using with PySpark

With Spark Packages

Pyspark Cassandra is published at Spark Packages. This allows easy usage with Spark through:

spark-submit \
    --packages TargetHolding/pyspark-cassandra:<version> \
    --conf spark.cassandra.connection.host=your,cassandra,node,names

Without Spark Packages

spark-submit \
    --jars /path/to/pyspark-cassandra-assembly-<version>.jar \
    --driver-class-path /path/to/pyspark-cassandra-assembly-<version>.jar \
    --py-files /path/to/pyspark-cassandra-assembly-<version>.jar \
    --conf spark.cassandra.connection.host=your,cassandra,node,names \
    --master spark://spark-master:7077 \
    yourscript.py

(note that the the --driver-class-path due to SPARK-5185) (also not that the assembly will include the python source files, quite similar to a python source distribution)

Using with PySpark shell

Replace spark-submit with pyspark to start the interactive shell and don't provide a script as argument and then import PySpark Cassandra. Note that when performing this import the sc variable in pyspark is augmented with the cassandraTable(...) method.

import pyspark_cassandra

Building

For Spark Packages Pyspark Cassandra can be published using:

sbt compile

The package can be published locally with:

sbt spPublishLocal

The package can be published to Spark Packages with (requires authentication and authorization):

make publish

For local testing / without Spark Packages

A Java / JVM library as well as a python library is required to use PySpark Cassandra. They can be built with:

make dist

This creates a fat jar with the Spark Cassandra Connector and additional classes for bridging Spark and PySpark for Cassandra data and the .py source files at: target/scala-2.10/pyspark-cassandra-assembly-<version>.jar

API

The PySpark Cassandra API aims to stay close to the Cassandra Spark Connector API. Reading its documentation is a good place to start.

pyspark_cassandra.RowFormat

The primary representation of CQL rows in PySpark Cassandra is the ROW format. However sc.cassandraTable(...) supports the row_format argument which can be any of the constants from RowFormat:

Column values are related between CQL and python as follows:

CQL python
ascii unicode string
bigint long
blob bytearray
boolean boolean
counter int, long
decimal decimal
double float
float float
inet str
int int
map dict
set set
list list
text unicode string
timestamp datetime.datetime
timeuuid uuid.UUID
varchar unicode string
varint long
uuid uuid.UUID
UDT pyspark_cassandra.UDT

pyspark_cassandra.Row

This is the default type to which CQL rows are mapped. It is directly compatible with pyspark.sql.Row but is (correctly) mutable and provides some other improvements.

pyspark_cassandra.UDT

This type is structurally identical to pyspark_cassandra.Row but serves user defined types. Mapping to custom python types (e.g. via CQLEngine) is not yet supported.

pyspark_cassandra.CassandraSparkContext

A CassandraSparkContext is very similar to a regular SparkContext. It is created in the same way, can be used to read files, parallelize local data, broadcast a variable, etc. See the Spark Programming Guide for more details. But it exposes one additional method:

pyspark.RDD

PySpark Cassandra supports saving arbitrary RDD's to Cassandra using:

pyspark_cassandra.CassandraRDD

A CassandraRDD is very similar to a regular RDD in pyspark. It is extended with the following methods:

pyspark_cassandra.streaming

When importing pyspark_cassandra.streaming the method ``saveToCassandra(...)``` is made available on DStreams. Also support for joining with a Cassandra table is added:

Examples

Creating a SparkContext with Cassandra support

import pyspark_cassandra

conf = SparkConf() \
    .setAppName("PySpark Cassandra Test") \
    .setMaster("spark://spark-master:7077") \
    .set("spark.cassandra.connection.host", "cas-1")

sc = CassandraSparkContext(conf=conf)

Using select and where to narrow the data in an RDD and then filter, map, reduce and collect it::

sc \
    .cassandraTable("keyspace", "table") \
    .select("col-a", "col-b") \
    .where("key=?", "x") \
    .filter(lambda r: r["col-b"].contains("foo")) \
    .map(lambda r: (r["col-a"], 1)
    .reduceByKey(lambda a, b: a + b)
    .collect()

Storing data in Cassandra::

rdd = sc.parallelize([{
    "key": k,
    "stamp": datetime.now(),
    "val": random() * 10,
    "tags": ["a", "b", "c"],
    "options": {
        "foo": "bar",
        "baz": "qux",
    }
} for k in ["x", "y", "z"]])

rdd.saveToCassandra(
    "keyspace",
    "table",
    ttl=timedelta(hours=1),
)

Create a streaming context, convert every line to a generater of words which are saved to cassandra. Through this example all unique words are stored in Cassandra.

The words are wrapped as a tuple so that they are in a format which can be stored. A dict or a pyspark_cassandra.Row object would have worked as well.

from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming

ssc = StreamingContext(sc, 2)

ssc \
    .socketTextStream("localhost", 9999) \
    .flatMap(lambda l: ((w,) for w in (l,))) \
    .saveToCassandra('keyspace', 'words')

ssc.start()

Joining with Cassandra:

joined = rdd \
    .joinWithCassandraTable('keyspace', 'accounts') \
    .on('id') \
    .select('e-mail', 'followers')

for left, right in joined:
    ...

Or with a DStream:

joined = dstream.joinWithCassandraTable(self.keyspace, self.table, ['e-mail', 'followers'], ['id'])