Dec 28, 2016 Added date-only
and date-only-iso
as field types, in such that only the date field will be added to the document without the _yy
, _mm
, _dd
, _hh
and _dow
fields.
fields=date-only,a_date,6,yyyy-MM-dd HH:mm:ss,1970-01-01 00:00:00;\
date-only-iso,a_date_iso,6,yyyy-MM-dd HH:mm:ss,1970-01-01 00:00:00
Dec 21, 2016 Added date.pattern
as a property with default value of YYYY-MM-dd HH:mm:ss
. This specifies the output date format. This has to match what is defined in the json mapping file.
For example:
"a_date": {
"type": "date",
"format": "YYYY-MM-dd'T'HH:mm:ss'Z'"
}
date.pattern=YYYY-MM-dd'T'HH:mm:ss'Z'
fields=date,a_date,6,yyyy-MM-dd HH:mm:ss,1970-01-01 00:00:00
The 6th field is parsed using the yyyy-MM-dd HH:mm:ss
pattern and stored in ES using the YYYY-MM-dd'T'HH:mm:ss'Z'
pattern.
If the input value is empty, then it will store the value 1970-01-01T00:00:00Z
.
Note that what we define is the missing input place holder, onto which the date.pattern
is applied and the resulting value is placed in the ES document.
Dec 20, 2016 Refactored the code to use field factories, eliminating the need for @transient
fields. In addition, the field definition supports a missing value as the last optional argument, in such that if the parsed field is empty or has a value of 'null', or 'undefined', then the missing value will use as a placeholder value in the document. Still TODO: Implement grid
and date-time
.
The following is a sample property file:
spark.app.id=CSV TO ES sample
spark.ui.enabled=false
es.nodes=${ES_PORT_9200_TCP_ADDR}
header.count=1
index.mapping=sample/geo
error.exception=false
hex.x=-20000000.0
hex.y=-20000000.0
hex.sizes=10,10;25,25;50,50;100,100;200,200;500,500
input.path=hdfs:///user/root/sample.csv
field.sep=,
fields=geo,loc,1,2;\
text,id,0;\
text,a_text,3,N/A;\
nume,a_nume,4,-99;\
real,a_real,5,-99.9;\
date,a_date,6,yyyy-MM-dd HH:mm:ss,1970-01-01 00:00:00
Aug 24, 2016 Please start using date-iso
rather that date
, as it places in ES the date values in ISO8601 format.
This project is divided into 2 sub-projects:
This sub project is a Spark based job to bulk load spatial and temporal point data from one or more CSV files into Elasticsearch
The loader expands the date fields into individual year, month, day, hour and day of week fields. And the spatial lat/lon fields are converted to geo_point fields. In addition, the lat/lon fields are expanded to x/y web mercator fields and into one or more hexagonal row/columns indexes at different sizes.
The intent of these expansions is to take advantage of the Elasticsearch aggregation framework
to query and map the density of the imported documents for better insight.
In time, documents can be aggregated by year, month, day, hour and day of week properties.
In space, documents can be aggregated by their hexagonal index properties.
The job is configured using a command line parameter file in properties format.
The following is a sample with the necessary keys:
#
# Spark master - http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
#
spark.master=local[*]
#
# Elasticsearch hosts
#
es.nodes=localhost
#
# Optional field to be used as the document identifier
#
es.mapping.id=object_id
#
# The index and mapping names
#
index.mapping=myindex/mymapping
#
# Specify input coordinate spatial reference
# For example: http://spatialreference.org/ref/epsg/wgs-84/
#
input.epsg=EPSG:4326
#
# The CSV input path
#
input.path=/a/path/to/points.csv
#
# Number of header lines to skip
#
header.count=0
#
# Throw an exception on parsing error
#
error.exception=true
#
# Field separator - default is tab
# If the value starts with 0 then it is conveted to hex number and then it is then converted a char
#
field.sep=,
#
# The hexagon grid origin in mercator values - default are (0.0,0.0)
# Set values to -20000000.0 to make the row and col hex cell values be positive
#
hex.x=0.0
hex.y=0.0
#
# Geo field x/y bounding values. Considered inside when xmin <= geo.x <= xmax and ymin <= geo.y <= ymax
#
xmin=-180.0
ymin=-90
xmax=180.0
ymax=90.0
#
# The hexagon indexes to generate for each geo type field.
#
# format: label1,size1[;label2,size2] ...
#
hex.sizes=\
50,50.0;\
100,100.0;\
1K,1000.0
#
# The fields to import, and how to interpret each field.
#
# format: type,property,csv_index[,additional type configuration]
#
# Note: csv_index is zero based !
#
# types:
#
# oid - optional document identifier field
# geo - the last two parameters indicate the lon and lat field indexes in the csv file.
# date-iso - the last parameter indicates how to parse the csv date field.
# grid - field name, lon index, lat index, grid size in meters
#
fields=\
oid,object_id,0;\
geo,loc,1,2;\
grid,gid,1,2,1000\
date-iso,a_date,3,YYYY-MM-dd HH:mm:ss;\
string,a_string,4;\
int,a_int,4;\
float,a_float,5
The fields of type date-iso
will be expanded to additional properties of type integer
named in the following convention:
The date-iso
fields are parsed using joda
and should be mapped to the following ES property as an example:
"a_date": {
"type": "date",
"format": "date_optional_time"
}
The fields of type geo
will be expanded to additional properties of type integer
named in the following convention:
For the above sample, the following properties will be created: loc_xm
, loc_ym
, loc_50
, loc_100
, loc_1K
The fields of type grid
will be expanded to additional properties:
For the above sample, the following properties will be created: gid_xm
, gid_ym
, gid_g
Make sure to first clone and install the following projects:
$ mvn install
To build the project to be used in the Cloudera Docker Container
mvn -P1.6.0-cdh5.7.0 clean package
To build the project to use spark-1.4.1-bin-hadoop2.6
mvn -Pspark-1.4.1-bin-hadoop2.6 clean package
This will create the Spark Job spark-csv-es-[VERSION].jar
in the target
folder.
The following is a simple example on how to bulk load The District Of Columbia 311 Service Requests.
I downloaded the data in CSV format from 01/01/2010 to 12/31/2010.
This is not a big dataset (about 430,000 records), but it is nice one to start with as each record has a couple of dates, times, a latitude and a longitude value.
The file is henceforth named DC_311_2010.csv
.
Download and install Elasticseach
Note: Before starting the server, make sure to:
ES_HEAP_SIZE
to set both the minimum and maximum memory to allocate.network.host
in the config/elasticsearch.yml
. This will be henceforth known as ${ES_HOST}
.Kopf is a web admin interface for elasticsearch
$ bin/plugin --install lmenezes/elasticsearch-kopf/master
$ open http://${ES_HOST}:9200/_plugin/kopf
This plugin enables the query of Elasticsearch indexes using SQL.
$ bin/plugin -u https://github.com/NLPchina/elasticsearch-sql/releases/download/1.3.4/elasticsearch-sql-1.3.4.zip --install sql
$ open http://${ES_HOST}:9200/_plugin/sql
Note: Had to restart the server for this plugin to work - not sure why !
Since the downloaded CSV file has a header to describe the values in each record,
we will use the header values to generate an initial .properties
file using the gen-properties.py
script.
$ gen-properties.py DC_311_2010.csv
The script reads the first line of the DC_311_2010.csv
file and generates a sibling file
named DC_311_2010.properties
, where all the document properties are assumed to be of type string
.
Scanning a sample of the input data, we can recognize date formatted values, integers, floats and geo fields.
Check out the adjustments in the file data/DC_311_2010.properties
.
Note to myself: One of these days, have to integrate a 'smart' value recognizer, rather than doing this by hand. After all, that is what computers are for :-)
A mapping file can be generated from
the .properties
file using the gen-mapping.py
script.
Make sure to first install the pyjavaproperties python module.
$ pip install pyjavaproperties
$ gen-mapping.py DC_311_2010.properties
Note You might want to modify the number_of_shards
and number_of_replicas
as they are set for a development
environment with values of 1
and 0
.
$ curl -XDELETE "${ES_HOST}:9200/dc";curl -XPOST "${ES_HOST}:9200/dc" -d @DC_311_2010.json
In the ESToolbox.pyt
toolbox exists the CreateIndexTool
tool to create an ES index.
Make sure to first install the elasticsearch-py python module.
$ pip install elasticsearch
Download and install Apache Spark and make sure the Spark bin
path
is appended to the value of your PATH
environment variable.
Bulk load the CSV by submitting the following spark job:
$ spark-submit spark-csv-es-[VERSION].jar DC_311_2010.properties
The data shards and replications can be seen in the kopf web dashboard:
The data can be queried using the SQL plugin:
SELECT loc, loc_xm, loc_ym FROM dc/sr311 limit 32
Query the geo locations and associated web mercator values for the first 32 documents.
SELECT servicecodedescription,count(servicecodedescription) FROM dc/sr311 where serviceduedate_dow=1 group by servicecodedescription
Count by service code description
the servive requests that are due on a Monday (dow=1)
SELECT loc_100,count(loc_100) FROM dc/sr311 where serviceduedate_dow=1 and loc_100 <> '0:0' group by loc_100
Find the population density of service requests that have a due date on a Monday at a 100 meter resolution.
The loc_100
value is a composite of the row and column of a 100 meter hexagon centroid.
The following section will describe how to convert the centroid row column values into a polygon feature to be displayed on a map.
To view the documents, we will query Elasticsearch using an ArcPy tool. The tool submits SQL queries to the SQL plugin rest endpoint (http://${ES_HOST}:9200/_sql) and the resulting documents are converted to features in an in-memory feature set. The feature set is defined as an output parameter to the tool and will be implicitly added to the map as a layer.
There are 2 tools in the ESToolbox.pyt
tool box:
This tool enables the user to define a where
clause to filter the service requests on the server side and
converts the resulting hits into point features.
Note: there is a hard limit of 100,000 resulting documents in the default implementation.
This tool enables the user to define a where
clause to filter the service requests on the server side and a hexagonal clustering size.
The resulting hits are converted to features with hexagon shaped polygons for geometry and the number of service requests that fall into its area
as attribute.
There is no spatial point in polygon operation in the above query. This is a pure group by
SQL operation,
because all the services requests have be preprocessed at load time with the hexagonal indexes that they will fall into.
And these indexes have been added as searchable properties into the document enabling fast aggregation based on a user specified index.
What is actually returned by Elasticsearch is the hexagon index in the form of a row and column. The conversion from row/column to an array of coordinates that make up the polygon shape is done on the client side at runtime. Retuning indexes massively reduces the network payload and makes the server do less work because the polygon coordinates generation is offloaded to the client. This is cooperative processing between the client and the server, a term I used to use 20 years ago on SDE when I was a specialist on it !