Spark GDB Library

A library for parsing and querying an Esri File Geodatabase with Apache Spark.

This work in progress is a pure Scala read-only implementation based on this reverse engineered specification. Understanding the internal file structure enables partitioning to perform massive parallel reading. The reading API is based on the Hadoop File System API enabling the placement of the GDB in HDFS or S3 (not tested) for multi node access. There is still a lot to be done, but is a good start. Eventually, I will merge this project with my Ibn Battuta Project.

TODO (not in specified order)

Update (Jan 7 2016) - for Polylines and Polygons, I'm mimicking the core Spark Vector class. In a way, they have almost the same semantics.

Implement asGeometry
~~Use Esri Geometry Library rather than JTS (I love JTS, so many utility functions on they geometry model)~~
Implement ~~Point~~, ~~Polyline~~ and ~~Polygon~~ as Spatial Type using UDT spec.
Handle more shapes - multiXXX and with Z and M
Read default values in field definitions
Register custom Kryo serializer for shapes (optimization - but worth it :-)
Perform a scan rather than a seek if the index row count is the same as the table count (should help performance)
Test Multi part geometries
Test XML field type
Test Blob field type
Handle Raster (super low priority)
Make internal structure of Shapes more like GeoJSON but will be heavy for SerDe !

Building From Source

This project build process is based on Apache Maven

export SPARK_LOCAL_IP=localhost
mvn install

The test data in src/test/resources/Test.gdb was generated using the ArcPy Tool src/test/python/TestToolbox.pyt

Thought the coordinates of the shapes are random, the coordinates values are placed as attributes for testing. In the case of the Points feature class, the x/y coordinate values should match the values in the attributes X and Y enabling cross checking during testing.

Using with Spark shell

$SPARK_HOME/bin/spark-shell --packages com.esri:spark-gdb:0.7

import com.esri.gdb._
import com.esri.udt._
sc.gdbFile("src/test/resources/Test.gdb", "Points", numPartitions = 2).map(row => {
  row.getAs[PointType](row.fieldIndex("Shape"))
}).foreach(println)

val df = sqlContext.read.
    format("com.esri.gdb").
    option("path", "src/test/resources/Test.gdb").
    option("name", "Points").
    load()
df.printSchema()
df.registerTempTable("points")
sqlContext.sql(s"select * from points").show()

Using UDT and UDFs

For the Spatial UDT (User Defined Types), I am following the VectorUDT implementation.

In Scala:

val df = sqlContext.read.format("com.esri.gdb")
  .option("path", path)
  .option("name", name)
  .option("numPartitions", "1")
  .load()

df.printSchema()
df.registerTempTable(name)

sqlContext.udf.register("getX", (point: PointType) => point.x)
sqlContext.udf.register("getY", (point: PointType) => point.y)
sqlContext.udf.register("plus2", (point: PointType) => PointType(point.x + 2, point.y + 2))

sqlContext.sql(s"select getX(plus2(Shape)),getX(Shape) as y from $name")
  .show(20)

In Python:

df = sqlContext.read \
    .format("com.esri.gdb") \
    .options(path="../../test/resources/Test.gdb", name=gdb_name, numPartitions="1") \
    .load()

df.printSchema()

df.registerTempTable(gdb_name)

sqlContext.registerFunction("getX", lambda p: p.x, DoubleType())
sqlContext.registerFunction("getY", lambda p: p.y, DoubleType())
sqlContext.registerFunction("plus2", lambda p: PointType(p.x + 2, p.y + 2), PointUDT())

rows = sqlContext.sql("select plus2(Shape),X,Y from {}".format(gdb_name))
for row in rows.collect():
    print row

Testing In HDFS (Yet Another Excuse To Use Docker :-)

We will use Docker to bootstrap a Cloudera quickstart container instance.

I highly recommend the installation of the Docker Toolbox for a Docker quick start.

Compile the project with the quickstart profile:

mvn -Pquickstart clean package

Create a local docker enabled machine using 6 cores, 8GB of RAM and with 10 GB of virtual disk:

docker-machine create\
 --driver virtualbox\
 --virtualbox-cpu-count 6\
 --virtualbox-memory 8192\
 --virtualbox-disk-size 10240\
 --virtualbox-no-vtx-check\
 quickstart

On Windows, I had to upgrade my machine instance:

docker-machine upgrade quickstart

Set up the docker environment:

eval $(docker-machine env quickstart)

Start a single node Hadoop instance with ZooKeeper, HDFS, YARN and Spark. (This is a 4GB download, so go grab some coffee and walk your dog, it is gonna take a while ! But, you only have to do that once. Ah... the beauty of docker images :-) We expose the ports for Cloudera Manager (7180), Hue (8888) and NameNode (50070). And to facilitate the moving of jars and test data from the host into the container, we map the /Users host folder onto the container /Users folder.

docker run\
 --privileged=true\
 --hostname=quickstart.cloudera\
 -v /Users:/Users\
 -p 7180:7180\
 -p 8888:8888\
 -p 50070:50070\
 -p 50075:50075\
 -p 21050:21050\
 -t -i cloudera/quickstart:latest\
 /usr/bin/docker-quickstart

Copy Test.gdb to HDFS:

hadoop fs -mkdir /data
hadoop fs -put /Users/<YOUR_PATH>/spark-gdb/src/test/resources/Test.gdb /data

Start A Spark shell:

spark-shell --jars /Users/<YOUR_PATH>/spark-gdb/target/spark-gdb-0.7.jar

Submit a Spark Context job:

import com.esri.gdb._
import com.esri.udt.PointType
sc.gdbFile("hdfs:///data/Test.gdb", "Points", numPartitions = 2).map(row => {
  row.getAs[PointType](row.fieldIndex("Shape"))
}).foreach(println)

Submit a SQL Context job:

val df = sqlContext.read.
    format("com.esri.gdb").
    option("path", "hdfs:///data/Test.gdb").
    option("name", "Lines").
    option("numPartitions", "2").
    load()
df.registerTempTable("lines")
sqlContext.sql("select * from lines").show()

Notes to self

set terminal type in windows to enable cursor movement:

set term=ansi

Start a CM instance:

/home/cloudera/cloudera-manager --express