Spark SAS Data Source (sas7bdat)

A library for reading SAS data (.sas7bdat) with Spark.

Requirements:

Download:

The latest jar can be downloaded from spark-packages.

Features:

This package allows reading SAS files from local and distributed filesystems, into Spark DataFrames.
Schema is automatically inferred from metadata embedded in the SAS file. (Behaviour can be customised, see parameters below)
The SAS format is splittable when not file-system compressed, thus we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.
This library uses parso for parsing as it is the only public available parser that handles both forms of SAS compression (CHAR and BINARY).

NOTE: this package does not support writing sas7bdat files

Docs:

Parameters:

extractLabel (Default: false)
- Boolean: extract column labels as column comments for Parquet/Hive
forceLowercaseNames (Default: false)
- Boolean: force column names to lower case
inferDecimal (Default: false)
- Boolean: infer numeric columns with format width >0 and format precision >0, as Decimal(Width, Precision)
inferDecimalScale (Default: each column's format width)
- Int: scale of inferred decimals
inferFloat (Default: false)
- Boolean: infer numeric columns with <=4 bytes, as Float
inferInt (Default: false)
- Boolean: infer numeric columns with <=4 bytes, format width >0 and format precision =0, as Int
inferLong (Default: false)
- Boolean: infer numeric columns with <=8 bytes, format width >0 and format precision =0, as Long
inferShort (Default: false)
- Boolean: infer numeric columns with <=2 bytes, format width >0 and format precision =0, as Short
metadataTimeout (Default: 60)
- Int: number of seconds to allow reading of file metadata (stops corrupt files hanging)
minSplitSize (Default: mapred.min.split.size)
- Long: minimum byte length of input splits (splits are always at least 1MB, to ensure correct reads)
maxSplitSize (Default: mapred.max.split.size)
- Long: maximum byte length of input splits, (can be decreased to force higher parallelism)

NOTE:

the order of precedence for numeric type inference is: Long -> Int -> Short -> Decimal -> Float -> Double
sas doesn’t have a concept of Long/Int/Short, instead people typically use column formatters with 0 precision

Scala API

val df = {
  spark.read
    .format("com.github.saurfang.sas.spark")
    .option("forceLowercaseNames", true)
    .option("inferLong", true)
    .load("cars.sas7bdat")
}
df.write.format("csv").option("header", "true").save("newcars.csv")

You can also use the implicit readers:

import com.github.saurfang.sas.spark._

// DataFrameReader
val df = spark.read.sas("cars.sas7bdat")
df.write.format("csv").option("header", "true").save("newcars.csv")

// SQLContext
val df2 = sqlContext.sasFile("cars.sas7bdat")
df2.write.format("csv").option("header", "true").save("newcars.csv")

(Note: you cannot use parameters like inferLong with the implicit readers.)

Python API

df = spark.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat", forceLowercaseNames=True, inferLong=True)
df.write.csv("newcars.csv", header=True)

R API

df <- read.df("cars.sas7bdat", source = "com.github.saurfang.sas.spark", forceLowercaseNames = TRUE, inferLong = TRUE)
write.df(df, path = "newcars.csv", source = "csv", header = TRUE)

SQL API

SAS data can be queried in pure SQL by registering the data as a (temporary) table.

CREATE TEMPORARY TABLE cars
USING com.github.saurfang.sas.spark
OPTIONS (path "cars.sas7bdat")

SAS Export Runner

We included a simple SasExport Spark program that converts .sas7bdat to .csv or .parquet files:

sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"

To achieve more parallelism, use spark-submit script to run it on a Spark cluster. If you don't have a spark cluster, you can always run it in local mode and take advantage of multi-core.

Spark Shell

spark-shell --master local[4] --packages saurfang:spark-sas7bdat:2.1.0-s_2.11

Caveats

spark-csv writes out null as "null" in csv text output. This means if you read it back for a string type, you might actually read "null" instead of null. The safest option is to export in parquet format where null is properly recorded. See https://github.com/databricks/spark-csv/pull/147 for alternative solution.

Related Work

Acknowledgements

This project would not be possible without parso continued improvements and generous contributions from @mulya, @thesuperzapper, and many others.