Spark Google Analytics Library

A library for querying Google Analytics data with Apache Spark, for Spark SQL and DataFrames.

Build Status

Requirements

This library requires Spark 1.4+

Linking

You can link against this library in your program at the following coordinates:

Scala 2.10

groupId: com.crealytics
artifactId: spark-google-analytics_2.10
version: 1.1.2

Scala 2.11

groupId: com.crealytics
artifactId: spark-google-analytics_2.11
version: 1.1.2

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-google-analytics_2.11:1.1.2

Spark compiled with Scala 2.10

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-google-analytics_2.10:1.1.2

Features

This package allows querying Google Analytics reports as Spark DataFrames. The API accepts several options (see the Google Analytics developer docs for details):

Scala API

Spark 1.4+:

import org.apache.spark.sql.SQLContext

Option 1 : Authentication with Service Account ID and P12 Key File

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.crealytics.google.analytics")
    .option("serviceAccountId", "[email protected]")
    .option("keyFileLocation", "the_key_file.p12")
    .option("ids", "ga:12345678")
    .option("startDate", "7daysAgo")
    .option("endDate", "yesterday")
    .option("queryIndividualDays", "true")
    .option("calculatedMetrics", "averageEngagement")
    .load()

// You need select the date column if using queryIndividualDays
df.select("date", "browser", "city", "users", "calcMetric_averageEngagement").show()

OR

Option 2 : Authentication with Client ID, Client Secret and Refresh Token

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.crealytics.google.analytics")
    .option("clientId", "XXXXXXXX-xyxyxxxxyxyxxxxxyyyx.apps.googleusercontent.com")
    .option("clientSecret", "73xxYxyxy-XXXYZZx-xZ_Z")
    .option("refreshToken", "1/ezzzxZYzxxyyXYXzyyXXYYyxxxxyyyyxxxy")
    .option("ids", "ga:12345678")
    .option("startDate", "7daysAgo")
    .option("endDate", "yesterday")
    .option("queryIndividualDays", "true")
    .option("calculatedMetrics", "averageEngagement")
    .load()

// You need select the date column if using queryIndividualDays
df.select("date", "browser", "city", "users", "calcMetric_averageEngagement").show()

Building From Source

This library is built with SBT, which is automatically downloaded by the included shell script. To build a JAR file simply run sbt/sbt package from the project root. The build configuration includes support for both Scala 2.10 and 2.11.