A configurable and extensible library for converting data into
SimpleFeatures, found in geomesa-convert
in the source distribution.
Converters for various different data formats can be configured and
instantiated using the SimpleFeatureConverters
factory and a target
SimpleFeatureType
. Currently available converters are:
- delimited text
- fixed width
- avro
- json
- xml
The converter allows the specification of fields extracted from the data
and transformations on those fields. Syntax of transformations is very
much like awk
syntax. Fields with names that correspond to attribute
names in the SimpleFeatureType
will be directly populated in the
result SimpleFeature. Fields that do not align with attributes in the
SimpleFeatureType
are assumed to be intermediate fields used for
deriving attributes. Fields can reference other fields by name for
building up complex attributes.
Suppose you have a SimpleFeatureType
with the following schema:
phrase:String,dtg:Date,geom:Point:srid=4326
and comma-separated data
as shown below.
first,hello,2015-01-01T00:00:00.000Z,45.0,45.0 second,world,2015-01-01T00:00:00.000Z,45.0,45.0
The first two fields should be concatenated together to form the phrase,
the third field should be parsed as a date, and the last two fields
should be formed into a Point
geometry. The following configuration
file defines an appropriate converter for taking this csv data and
transforming it into our SimpleFeatureType
.
{ type = "delimited-text", format = "CSV", id-field = "md5($0)", user-data = { // note: keys will be treated as strings and should not be quoted my.user.key = "$phrase" } fields = [ { name = "phrase", transform = "concatenate($1, $2)" }, { name = "lat", transform = "$4::double" }, { name = "lon", transform = "$5::double" }, { name = "dtg", transform = "dateHourMinuteSecondMillis($3)" }, { name = "geom", transform = "point($lon, $lat)" } ] }
The id
of the SimpleFeature
is formed from an md5 hash of the
entire record ($0
is the original data). The simple feature attributes
are created from the fields
list with appropriate transforms (note the
use of intermediate fields 'lat' and 'lon'). If desired, user data for the
feature can be set by referencing fields. This can be used for setting
Accumulo visibility constraints, among other things (see :ref:`accumulo_visibilities`).
Provided transformation functions are listed below.
try
stripQuotes
length
trim
capitalize
lowercase
regexReplace
concatenate
substring
toString
now
date
dateTime
basicDate
basicDateTime
basicDateTimeNoMillis
dateHourMinuteSecondMillis
millisToDate
secsToDate
point
linestring
polygon
geometry
stringToBytes
md5
uuid
base64
::int
or::integer
::long
::float
::double
::boolean
::r
stringToInt
orstringToInteger
stringToLong
stringToFloat
stringToDouble
stringToBoolean
parseList
parseMap
You can define functions using scripting languages that support JSR-223.
This is currently tested with Javascript only as it is natively
supported in all JREs via the Nashorn extension. To define a javascript
function for use in the converter framework, either put the file in
geomesa-convert-scripts
on the classpath or set the system property
geomesa.convert.scripts.path
to be a comma-separated list of paths
to load functions from. Then, any function you define in a file in one
of those paths will be available in a convert definition with a
namespace prefix. For instance, if you have defined a function such as
function hello(s) {
return "hello: " + s;
}
you can reference that function in a transform expression as
js:hello($2)
Most of the basic CQL functions are available as transformations. To use
one, invoke it like a regular function, prefixed with the cql
namespace. For example, you can use the CQL buffer function to turn a
point into a polygon:
cql:buffer($1, 2.0)
For more information on the various CQL functions, see the GeoServer filter function reference.
See Parsing Json and Parsing Avro sections
Description: Execute another function - if it fails, instead use a default value
Usage: try($1, $2)
Example: try("1"::int, 0) = 1
Example: try("abcd"::int, 0) = 0
Description: Remove double quotes from a string.
Usage: stripQuotes($1)
Example: stripQuotes('fo"o') = foo
Description: Returns the length of a string.
Usage: length($1)
Example: length('foo') = 3
Description: Trim whitespace from around a string.
Usage: trim($1)
Example: trim(' foo ') = foo
Description: Capitalize a string.
Usage: capitalize($1)
Example: capitalize('foo') = Foo
Description: Lowercase a string.
Usage: lowercase($1)
Example: lowercase('FOO') = foo
Description: Uppercase a string.
Usage: uppercase($1)
Example: uppercase('foo') = FOO
Description: Replace a given pattern with a target pattern in a string.
Usage: regexReplace($regex, $replacement, $1)
Example: regexReplace('foo'::r, 'bar', 'foobar') = barbar
Description: Concatenate two strings.
Usage: concatenate($0, $1)
Example: concatenate('foo', 'bar') = foobar
Description: Return the substring of a string.
Usage: substring($1, $startIndex, $endIndex)
Example: substring('foobarbaz', 2, 5) = oba
Description: Convert another data type to a string.
Usage: toString($0)
Example: concatenate(toString(5), toString(6)) = '56'
Description: Use the current system time.
Usage: now()
Description: Custom date parser.
Usage: date($format, $1)
Example:
date('YYYY-MM-dd\'T\'HH:mm:ss.SSSSSS', '2015-01-01T00:00:00.000000')
Description: A strict ISO 8601 Date parser for format
yyyy-MM-dd'T'HH:mm:ss.SSSZZ
.
Usage: dateTime($1)
Example: dateTime('2015-01-01T00:00:00.000Z')
Description: A basic date format for yyyyMMdd
.
Usage: basicDate($1)
Example: basicDate('20150101')
Description: A basic format that combines a basic date and time for
format yyyyMMdd'T'HHmmss.SSSZ
.
Usage: basicDateTime($1)
Example: basicDateTime('20150101T000000.000Z')
Description: A basic format that combines a basic date and time with no
millis for format yyyyMMdd'T'HHmmssZ
.
Usage: basicDateTimeNoMillis($1)
Example: basicDateTimeNoMillis('20150101T000000Z')
Description: Formatter for full date, and time keeping the first 3
fractional seconds for format yyyy-MM-dd'T'HH:mm:ss.SSS
.
Usage: dateHourMinuteSecondMillis($1)
Example: dateHourMinuteSecondMillis('2015-01-01T00:00:00.000')
Description: Create a new date from a long representing milliseconds since January 1, 1970.
Usage: millisToDate($1)
Example: millisToDate('1449675054462'::long)
Description: Create a new date from a long representing seconds since January 1, 1970.
Usage: secsToDate($1)
Example: secsToDate(1449675054)
Description: Parse a Point geometry from lon/lat or WKT.
Usage: point($lon, $lat)
or point($wkt)
Note: Ordering is important here...GeoMesa defaults to longitude first
Example: Parsing lon/lat from JSON:
Parsing lon/lat
# config { name = "lon", json-type="double", path="$.lon" } { name = "lat", json-type="double", path="$.lat" } { name = "geom", transform="point($lon, $lat)" } # data { "lat": 23.9, "lon": 24.2, }
Example: Parsing lon/lat from text without creating lon/lat fields:
# config { name = "geom", transform="point($2::double, $3::double)" # data id,lat,lon,date identity1,23.9,24.2,2015-02-03
Example: Parsing WKT as a point
# config { name = "geom", transform="point($2)" } # data ID,wkt,date 1,POINT(2 3),2015-01-02
Description: Parse a linestring from a WKT string.
Usage: linestring($0)
Example: linestring('LINESTRING(102 0, 103 1, 104 0, 105 1)')
Description: Parse a polygon from a WKT string.
Usage: polygon($0)
Example: polygon('polygon((100 0, 101 0, 101 1, 100 1, 100 0))')
Description: Parse a geometry from a WKT string or GeoJson.
Example: Parsing WKT as a geometry
# config { name = "geom", transform="geometry($2)" } # data ID,wkt,date 1,POINT(2 3),2015-01-02
Example: Parsing GeoJson geometry
# config { name = "geom", json-type = "geometry", path = "$.geometry" } # data { id: 1, number: 123, color: "red", "geometry": {"type": "Point", "coordinates": [55, 56]} }
Description: Converts a string to a UTF-8 byte array.
Description: Creates an MD5 hash from a byte array.
Usage: md5($0)
Example: md5(stringToBytes('row,of,data'))
Description: Generates a random UUID.
Usage: uuid()
Description: Encodes a byte array as a base-64 string.
Usage; base64($0)
Example: base64(stringToBytes('foo'))
Description: Converts a string into an integer. Invalid values will cause the record to fail.
Example: '1'::int = 1
Description: Converts a string into a long. Invalid values will cause the record to fail.
Example: '1'::long = 1L
Description: Converts a string into a float. Invalid values will cause the record to fail.
Example: '1.0'::float = 1.0f
Description: Converts a string into a double. Invalid values will cause the record to fail.
Example: '1.0'::double = 1.0d
Description: Converts a string into a boolean. Invalid values will cause the record to fail.
Example: 'true'::boolean = true
Description: Converts a string into a Regex object.
Example: 'f.*'::r = f.*: scala.util.matching.Regex
Description: Converts a string into a integer, with a default value if conversion fails.
Usage; stringToInt($1, $2)
Example: stringToInt('1', 0) = 1
Example: stringToInt('', 0) = 0
Description: Converts a string into a long, with a default value if conversion fails.
Usage; stringToLong($1, $2)
Example: stringToLong('1', 0L) = 1L
Example: stringToLong('', 0L) = 0L
Description: Converts a string into a float, with a default value if conversion fails.
Usage; stringToFloat($1, $2)
Example: stringToFloat('1.0', 0.0f) = 1.0f
Example: stringToFloat('not a float', 0.0f) = 0.0f
Description: Converts a string into a double, with a default value if conversion fails.
Usage; stringToDouble($1, $2)
Example: stringToDouble('1.0', 0.0) = 1.0d
Example: stringToDouble(null, 0.0) = 0.0d
Description: Converts a string into a boolean, with a default value if conversion fails.
Usage; stringToBoolean($1, $2)
Example: stringToBoolean('true', false) = true
Example: stringToBoolean('55', false) = false
Description: Parse a List[T]
type from a string.
If your SimpleFeatureType config contains a list or map you can easily
configure a transform function to parse it using the parseList
function which takes either 2 or 3 args
- The primitive type of the list (int, string, double, float, boolean, etc)
- The reference to parse
- Optionally, the list delimiter (defaults to a comma)
Here's some sample CSV data:
ID,Name,Age,LastSeen,Friends,Lat,Lon 23623,Harry,20,2015-05-06,"Will, Mark, Suzan",-100.236523,23 26236,Hermione,25,2015-06-07,"Edward, Bill, Harry",40.232,-53.2356 3233,Severus,30,2015-10-23,"Tom, Riddle, Voldemort",3,-62.23
For example, an SFT may specific a field:
{ name = "friends", type = "List[String]" }
And a transform to parse the quoted CSV field:
{ name = "friends", transform = "parseList('string', $5)" }
Description: Parse a Map[T,V]
type from a string.
Parsing Maps is similar. Take for example this CSV data with a quoted map field:
1,"1->a,2->b,3->c,4->d",2013-07-17,-90.368732,35.3155 2,"5->e,6->f,7->g,8->h",2013-07-17,-70.970585,42.36211 3,"9->i,10->j",2013-07-17,-97.599004,30.50901
Our field type is:
numbers:Map[Integer,String]
Then we specify a transform:
{ name = "numbers", transform = "parseMap('int -> string', $2)" }
Optionally we can also provide custom list/record and key-value delimiters for a map:
{ name = "numbers", transform = "parseMap('int -> string', $2, ',', '->')" }
The JSON converter defines the path to a list of features as well as json-types of each field:
{ type = "json" id-field = "$id" feature-path = "$.Features[*]" fields = [ { name = "id", json-type = "integer", path = "$.id", transform = "toString($0)" } { name = "number", json-type = "integer", path = "$.number", } { name = "color", json-type = "string", path = "$.color", transform = "trim($0)" } { name = "weight", json-type = "double", path = "$.physical.weight", } { name = "geom", json-type = "geometry", path = "$.geometry", } ] }
Geometry objects can be represented as either WKT or GeoJSON and parsed with the same config:
Config:
{ name = "geom", json-type = "geometry", path = "$.geometry", transform = "point($0)" }
Data:
{ DataSource: { name: "myjson" }, Features: [ { id: 1, number: 123, color: "red", geometry: { "type": "Point", "coordinates": [55, 56] } }, { id: 2, number: 456, color: "blue", geometry: "Point (101 102)" } ] }
Remember to use the most general Geometry type as your json-type
or
SimpleFeatureType field type. Defining a type Geometry
allows for
polygons, points, and linestrings, but specifying a specific geometry
like point will only allow for parsing of points.
The Avro parsing library is similar to the JSON parsing library. For
this example we'll use the following avro schema in a file named
/tmp/schema.avsc
:
{ "namespace": "org.locationtech", "type": "record", "name": "CompositeMessage", "fields": [ { "name": "content", "type": [ { "name": "DataObj", "type": "record", "fields": [ { "name": "kvmap", "type": { "type": "array", "items": { "name": "kvpair", "type": "record", "fields": [ { "name": "k", "type": "string" }, { "name": "v", "type": ["string", "double", "int", "null"] } ] } } } ] }, { "name": "OtherObject", "type": "record", "fields": [{ "name": "id", "type": "int"}] } ] } ] }
This schema defines an avro file that has a field named content
which has a nested object which is either of type DataObj
or
OtherObject
. As an exercise...using avro tools we can generate some
test data and view it:
java -jar /tmp/avro-tools-1.7.7.jar random --schema-file /tmp/schema -count 5 /tmp/avro $ java -jar /tmp/avro-tools-1.7.7.jar tojson /tmp/avro {"content":{"org.locationtech.DataObj":{"kvmap":[{"k":"thhxhumkykubls","v":{"double":0.8793488185997134}},{"k":"mlungpiegrlof","v":{"double":0.45718223406586045}},{"k":"mtslijkjdt","v":null}]}}} {"content":{"org.locationtech.OtherObject":{"id":-86025408}}} {"content":{"org.locationtech.DataObj":{"kvmap":[]}}} {"content":{"org.locationtech.DataObj":{"kvmap":[{"k":"aeqfvfhokutpovl","v":{"string":"kykfkitoqk"}},{"k":"omoeoo","v":{"string":"f"}}]}}} {"content":{"org.locationtech.DataObj":{"kvmap":[{"k":"jdfpnxtleoh","v":{"double":0.7748286862915655}},{"k":"bueqwtmesmeesthinscnreqamlwdxprseejpkrrljfhdkijosnogusomvmjkvbljrfjafhrbytrfayxhptfpcropkfjcgs","v":{"int":-1787843080}},{"k":"nmopnvrcjyar","v":null},{"k":"i","v":{"string":"hcslpunas"}}]}}}
Here's a more relevant sample record:
{ "content" : { "org.locationtech.DataObj" : { "kvmap" : [ { "k" : "lat", "v" : { "double" : 45.0 } }, { "k" : "lon", "v" : { "double" : 45.0 } }, { "k" : "prop3", "v" : { "string" : " foo " } }, { "k" : "prop4", "v" : { "double" : 1.0 } } ] } } }
Let's say we want to convert our avro array of kvpairs into a simple feature. We notice that there are 4 attributes:
- lat
- lon
- prop3
- prop4
We can define a converter config to parse the avro:
{ type = "avro" schema-file = "/tmp/schema.avsc" sft = "testsft" id-field = "uuid()" fields = [ { name = "tobj", transform = "avroPath($1, '/content$type=DataObj')" }, { name = "lat", transform = "avroPath($tobj, '/kvmap[$k=lat]/v')" }, { name = "lon", transform = "avroPath($tobj, '/kvmap[$k=lon]/v')" }, { name = "geom", transform = "point($lon, $lat)" } ] }
GeoMesa Convert allows users to define "avropaths" to the data similar to a jsonpath or xpath. This AvroPath allows you to extract out fields from avro records into SFT fields.
Description: Extract values from nested Avro structures.
Usage: avroPath($ref, $pathString)
$ref
- a reference object (avro root or extracted object)pathString
- forward-slash delimited path strings. paths are field names with modifiers:$type=<typename>
- interpret the field name as an avro schema type[$<field>=<value>]
- select records with a field named "field" and a value equal to "value"
There are two ways to extend the converter library - adding new transformation functions and adding new data formats.
To add new transformation functions, create a
TransformationFunctionFactory
and register it in
META-INF/services/org.locationtech.geomesa.convert.TransformationFunctionFactory
.
For example, here's how to add a new transformation function that
computes a SHA-256 hash.
import org.locationtech.geomesa.convert.TransformerFunctionFactory
import org.locationtech.geomesa.convert.TransformerFn
class SHAFunctionFactory extends TransformerFunctionFactory {
override def functions = Seq(sha256fn)
val sha256fn = TransformerFn("sha256") { args =>
Hashing.sha256().hashBytes(args(0).asInstanceOf[Array[Byte]])
}
}
The sha256
function can then be used in a field as shown.
fields: [ { name = "hash", transform = "sha256(stringToBytes($0))" } ]
To add new data formats, implement the SimpleFeatureConverterFactory
and SimpleFeatureConverter
interfaces and register them in
META-INF/services
appropriately. See
org.locationtech.geomesa.convert.avro.Avro2SimpleFeatureConverter
for an example.
The following example can be used with GeoMesa Tools:
geomesa ingest -u <user> -p <pass> -i <instance> -z <zookeepers> -s renegades -C renegades-csv example.csv
Sample csv file: example.csv
:
ID,Name,Age,LastSeen,Friends,Lat,Lon 23623,Harry,20,2015-05-06,"Will, Mark, Suzan",-100.236523,23 26236,Hermione,25,2015-06-07,"Edward, Bill, Harry",40.232,-53.2356 3233,Severus,30,2015-10-23,"Tom, Riddle, Voldemort",3,-62.23
The "renegades" SFT and "renegades-csv" converter should be specified in
the GeoMesa Tools configuration file
($GEOMESA_HOME/conf/application.conf
). By default,
SimpleFeatureTypes (SFTs) should be loaded at the path geomesa.sfts
and converters should be loaded at the path geomesa.converters
. Each
converter and SFT definition is keyed by the name that can be referenced
in the converter and SFT loaders.
Use geomesa env
to confirm that geomesa ingest
can properly read
the updated file.
$GEOMESA_HOME/conf/application.conf
:
geomesa = { sfts = { # other sfts # ... "renegades" = { attributes = [ { name = "id", type = "Integer", index = false } { name = "name", type = "String", index = true } { name = "age", type = "Integer", index = false } { name = "lastseen", type = "Date", index = true } { name = "friends", type = "List[String]", index = true } { name = "geom", type = "Point", index = true, srid = 4326, default = true } ] } } converters = { # other converters # ... "renegades-csv" = { type = "delimited-text", format = "CSV", options { skip-lines = 1 }, id-field = "toString($id)", fields = [ { name = "id", transform = "$1::int" } { name = "name", transform = "$2::string" } { name = "age", transform = "$3::int" } { name = "lastseen", transform = "date('YYYY-MM-dd', $4)" } { name = "friends", transform = "parseList('string', $5)" } { name = "lon", transform = "$6::double" } { name = "lat", transform = "$7::double" } { name = "geom", transform = "point($lon, $lat)" } ] } } }
If you have defined converters or SFTs in typesafe config you can place
them on the classpath or load them with a ConverterConfigProvider or
SimpleFeatureTypeProvider via Java SPI loading. By default, classpath
and URL providers are provided. Placing a typesafe config file named
reference.conf
containing properly formatted converters and SFTs
(see example application.conf above) in a jar file on the classpath will
enable the reference of the converters and SFTs using the public loader
API:
// ConverterConfigLoader.scala
// Public API
def listConverterNames: List[String] = confs.keys.toList
def getAllConfigs: Map[String, Config] = confs
def configForName(name: String) = confs.get(name)
// SimpleFeatureTypeLoader.scala
// Public API
def listTypeNames: List[String] = sfts.map(_.getTypeName)
def sftForName(n: String): Option[SimpleFeatureType] = sfts.find(_.getTypeName == n)
The GeoMesa gm-data project contains common data formats packaged in jar files that can be placed on the classpath of your project.