spark-hadoopoffice-ds

Build Status Codacy Badge

A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1 (but we recommend at least Spark 2.3.0) and Scala 2.11. Scala 2.12 is supported on Spark 2.4.0 and higher. However, the HadoopOffice library can also be used directly from Spark 1.x and/or Scala 2.10 (see how to section). Currently this datasource supports the following formats of the HadoopOffice library:

This datasource is available on Spark-packages.org and on Maven Central.

Find here the status from the Continuous Integration service: https://travis-ci.org/ZuInnoTe/spark-hadoopoffice-ds/

Release Notes

Find the latest release information here

Options

All options from the HadoopOffice library, such as metadata, encryption/decryption or low footprint mode, are supported.

Additionally the following options exist:

There are the following options related to Spark in case you need to write rows containing primitive types. In this case a default sheetname need to be set:

Additionally, the following options of the standard Hadoop API are supported:

Dependency

A lot of options changed in version 1.2.0 to harmonize behavior with other Big Data platforms. Read carefully the documentation and test your application.

Scala 2.11

groupId: com.github.zuinnote

artifactId: spark-hadoopoffice-ds_2.11

version: 1.3.10

Scala 2.12

groupId: com.github.zuinnote

artifactId: spark-hadoopoffice-ds_2.12

version: 1.3.10

The Scala 2.12 version requires at least Spark 2.4.0

Older Scala versions

Note: If you require Scala 2.10 then you cannot use this data source, but you can use the Hadoop FileFormat if you want to use the latest HadoopOffice version, cf. an example for reading and writing.

Alternatively you can use the older version of this data source (not recommended): 1.1.1 (see documentation). However, in this case you will miss features and bug fixes.

Schema

There are two different schemas that you can configure:

The Excel cell schema is very useful in case you want to have more information about the cell and the simple schema is useful in case you want to work only with the data (e.g. doing calculations, filtering by date etc.).

Excel Cell

An Excel file loaded into a DataFrame has the following schema. Basically each row contains an Array with all Excel cells in this row. For each cell the following information are available:

Develop

Reading

As you can see in the schema, the datasource reads each Excel row in an array. Each element of the array is a structure describing an Excel cell. This structure describes the formatted value (based on the locale), the comment, the formula, the address of the cell in A1 format and the name of the sheet to which the cell belongs. In Scala you can easily read Excel files using the following snippet (assuming US locale for the Excel file):

val sqlContext = sparkSession.sqlContext
val df = sqlContext.read
    .format("org.zuinnote.spark.office.excel")
    .option("read.locale.bcp47", "us")  
.load(args(0))

Find a full example here.

Another option is to infer the schema of primitive Spark SQL DataTypes automatically:

val sqlContext = sparkSession.sqlContext
val df = sqlContext.read
    .format("org.zuinnote.spark.office.excel")
    .option("read.locale.bcp47", "us").option("read.spark.simpleMode",true)  
.load(args(0))

This option can be combined with hadoopoffice.read.header.read to interpret the first row in the Excel as column names of the DataFrame.

Writing

You can have two options for writing data to Excel files:

The second option is illustrated in this snippet (Assuming US locale for the Excel). It creates a simple Excel document with 4 cells. They are stored in sheet "Sheet1". The following Cells exist (A1 with value 1), (A2 with value 2 and comment), (A3 with value 3), (B1 with formula A2+A3). The resulting Excel file is stored in the directory /home/user/office/output

val sRdd = sparkSession.sparkContext.parallelize(Seq(Seq("","","1","A1","Sheet1"),Seq("","This is a comment","2","A2","Sheet1"),Seq("","","3","A3","Sheet1"),Seq("","","A2+A3","B1","Sheet1"))).repartition(1)
    val df= sRdd.toDF()
    df.write
      .format("org.zuinnote.spark.office.excel")
    .option("write.locale.bcp47", "us") 
.save("/home/user/office/output")

Find a full example here.

You can write with partitions as follows (as of v 1.3.2). Let us assume you have an Excel with Name, Year, Month, Day columns and you want to create partitions by Year, Month, Day. Then you need to use the following code:

df.toDF.write.partitionBy("year","month","day").format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "us")
.save("/home/user/office/output")

This will create the following structure on HDFS (or the filesystem that is supported by Spark):

output/_SUCCESS
output/year=2018/month=1/day=1/part-00000.xlsx
output/year=2019/month=12/day=31/part-00000.xlsx

Language bindings

Scala

This example loads Excel documents from the folder "/home/user/office/input" using the Excel representation (format) and shows the total number of rows, the schema and the first 20 rows. The locale for formatting cell values is set to "us". Find a full example here.

val sqlContext = sparkSession.sqlContext
val df = sqlContext.read
    .format("org.zuinnote.spark.office.excel")
    .option("read.locale.bcp47", "us")  // example to set the locale to us
    .load("/home/user/office/input")
    val totalCount = df.count
    // print to screen
    println("Total number of rows in Excel: "+totalCount)   
    df.printSchema
    // print formattedValues
df.show 

Java

This example loads Excel documents from the folder "/home/user/office/input" using the Excel representation (format) and shows the total number of rows, the schema and the first 20 rows. The locale for formatting cell values is set to "us".

 SQLContext sqlContext = sparkSession.sqlContext;
 Dataframe df = sqlContext.read
 .format("org.zuinnote.spark.office.excel")
    .option("read.locale.bcp47", "us")  // example to set the locale to us
    .load("/home/user/office/input");
    long totalCount = df.count;
    // print to screen
    System.out.println("Total number of rows in Excel: "+totalCount);
    df.printSchema();
    // print formattedValues
df.show();

R

This example loads Excel documents from the folder "/home/user/office/input" using the Excel representation (format). The locale for formatting cell values is set to "us".

library(SparkR)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.3.10" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "/home/user/office/input", source = "org.zuinnote.spark.office.excel", "read.locale.bcp47" = "us")

Python

This example loads Excel documents from the folder "/home/user/office/input" using the Excel representation (format).The locale for formatting cell values is set to "us".

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('org.zuinnote.spark.office.excel').options('read.locale.bcp47'='us').load('/home/user/office/input')

SQL

The following statement creates a table that contains Excel data in the folder //home/user/office/input. The locale for formatting cell values is set to "us".

CREATE TABLE ExcelData
USING  org.zuinnote.spark.office.excel
OPTIONS (path "/home/user/office/input", read.locale.bcp47 "us")