spark-fixedwidth

Fixed-width data source for Spark SQL and DataFrames. Based on (and uses) databricks-spark-csv

Requirements

This library requires Spark 1.3+ and Scala 2.11+

Building

Run sbt assembly from inside the root directory to generate a JAR

Running / Using

In the Spark Shell

./bin/spark-shell --jars <PATH_TO>/spark-fixedwidth/target/scala-2.11/spark-fixedwidth-assembly-1.0.jar

In another project

Add the JAR to your project lib and sbt will include it for you

Features

This package allows reading fixed-width files in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:

Scala API

Spark 1.4+:

See sample fixed-width files

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType}
import com.quartethealth.spark.fixedwidth.FixedwidthContext

val fruitSchema = StructType(Seq(
    StructField("val", IntegerType),
    StructField("name", StringType),
    StructField("avail", StringType),
    StructField("cost", DoubleType)
))

val sqlContext = new SQLContext(sc) // sc is defined in the spark console
val fruitWidths = Array(3, 10, 5, 4)
val fruit_resource = "fruit_fixedwidths.txt"

val result = sqlContext.fixedFile(
    fruit_resource,
    fruitWidths,
    fruitSchema,
    useHeader = false
)
result.show() // Prints top 20 rows in tabular format

// Example without schema, and showing extra options
val fruit_resource = "fruit_w_headers_fixedwidths.txt"
val result = sqlContext.fixedFile(
    fruit_resource,
    fruitWidths,
    useHeader = true,
    inferSchema = true,
    mode = "DROPMALFORMED",
    comment = '/',
    ignoreLeadingWhiteSpace = true,
    ignoreTrailingWhiteSpace = false
)
result.collect() // Returns an array that contains all of Rows in this DataFrame