Spark SAS Data Source (sas7bdat)

A library for reading SAS data (.sas7bdat) with Spark.

Build Status Join the chat at https://gitter.im/saurfang/spark-sas7bdat

Requirements:

Download:

The latest jar can be downloaded from spark-packages.

Features:

NOTE: this package does not support writing sas7bdat files

Docs:

Parameters:

NOTE:

Scala API

val df = {
  spark.read
    .format("com.github.saurfang.sas.spark")
    .option("forceLowercaseNames", true)
    .option("inferLong", true)
    .load("cars.sas7bdat")
}
df.write.format("csv").option("header", "true").save("newcars.csv")

You can also use the implicit readers:

import com.github.saurfang.sas.spark._

// DataFrameReader
val df = spark.read.sas("cars.sas7bdat")
df.write.format("csv").option("header", "true").save("newcars.csv")

// SQLContext
val df2 = sqlContext.sasFile("cars.sas7bdat")
df2.write.format("csv").option("header", "true").save("newcars.csv")

(Note: you cannot use parameters like inferLong with the implicit readers.)

Python API

df = spark.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat", forceLowercaseNames=True, inferLong=True)
df.write.csv("newcars.csv", header=True)

R API

df <- read.df("cars.sas7bdat", source = "com.github.saurfang.sas.spark", forceLowercaseNames = TRUE, inferLong = TRUE)
write.df(df, path = "newcars.csv", source = "csv", header = TRUE)

SQL API

SAS data can be queried in pure SQL by registering the data as a (temporary) table.

CREATE TEMPORARY TABLE cars
USING com.github.saurfang.sas.spark
OPTIONS (path "cars.sas7bdat")

SAS Export Runner

We included a simple SasExport Spark program that converts .sas7bdat to .csv or .parquet files:

sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"

To achieve more parallelism, use spark-submit script to run it on a Spark cluster. If you don't have a spark cluster, you can always run it in local mode and take advantage of multi-core.

Spark Shell

spark-shell --master local[4] --packages saurfang:spark-sas7bdat:2.1.0-s_2.11

Caveats

  1. spark-csv writes out null as "null" in csv text output. This means if you read it back for a string type, you might actually read "null" instead of null. The safest option is to export in parquet format where null is properly recorded. See https://github.com/databricks/spark-csv/pull/147 for alternative solution.

Related Work

Acknowledgements

This project would not be possible without parso continued improvements and generous contributions from @mulya, @thesuperzapper, and many others.