The cloud-integration repository provides modules to improve Apache Spark's integration with cloud infrastructures.
spark-cloud-integration
Classes and Tools to make Spark work better in-cloud
Configuration
: class ConfigSerDeser
. Use this
to get a configuration into an RDD methodHConf
to manipulate the hadoop options in a spark config.FileInputStream
for cloud storage, org.apache.spark.streaming.cloudera.CloudInputDStream
cloud-examples
This does the packaging/integration tests for Spark and cloud against AWS, Azure and openstack.
These are basic tests of the core functionality of I/O, streaming, and verify that the commmitters work in the presence of inconsistent object storage As well as running as unit tests, they have CLI entry points which can be used for scalable functional testing.
minimal-integration-test
This is a minimal JAR for integration tests
Usage
spark-submit --class com.cloudera.spark.cloud.integration.Generator \
--master yarn \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
minimal-integration-test-1.0-SNAPSHOT.jar \
adl://example.azuredatalakestore.net/output/dest/1 \
2 2 15