Created
April 9, 2012 21:26
-
-
Save kornypoet/2346656 to your computer and use it in GitHub Desktop.
ElasticsearchStorage() examples
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| -- | |
| -- If your data looks like this: | |
| -- | |
| -- {"foo":1,"bar":1} | |
| -- {"foo":2,"bar":2} | |
| -- {"foo":3,"bar":3} | |
| -- | |
| -- Then write your store function this way; it will respect nested hashes and arrays if they are JSONed properly | |
| -- | |
| register wonderdog-1.0-SNAPSHOT.jar; | |
| data = LOAD '/path/to/data' AS (json:chararray); | |
| STORE data INTO 'es://index/obj?json=true' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage(); | |
| -- | |
| -- If your data looks like this: | |
| -- | |
| -- 1 {(foo),(bar),(baz)} | |
| -- 2 {(foo),(bar),(baz)} | |
| -- 3 {(foo),(bar),(baz)} | |
| -- | |
| -- Then you need to serialize the bag somehow beforehand and then use the tsv store function; | |
| -- My recommendation would be to do this before the LOAD, or else write a UDF to serialize the bag | |
| -- | |
| register wonderdog-1.0-SNAPSHOT.jar; | |
| data = LOAD '/path/to/data' AS (id:int,vals:bag{}); | |
| serialized = FOREACH data GENERATE id AS id, JsonizeBag(vals) AS (vals:chararray); | |
| STORE serialized INTO 'es://index/obj?json=false' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage(); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@rjurney, take a look at the "Query Parameters" section of the documentation page. Url parameters are being pulled from the location string. If the data you're storing is already serialized as json then you'll want the location string to contain the 'json=true' flag. Additionally, if the serialized json data contains a field you want to use as the id, perhaps it's called something like "my_id_field", then you'll want to put that in the location string as well, ie: "id=my_id_field". Otherwise elasticsearch will assign each new record an id.
Looking back at the source (I haven't touched it in quite a while) it appears that you're better off NOT using the alternative method since it treats every field as a string and doesn't handle complex types (eg. DataBag) or nulls properly.
@kornypoet, if you like, take a look at http://github.com/Ganglion/sounder/blob/master/udf/src/main/java/sounder/pig/json/ToJson.java. You could use the same logic there to convert an arbitrary tuple to a MapWritable. Even this is silly though (json to MapWritable to XContentBuilder to json again), ultimately ElasticSearchOutputFormat should be written to handle incoming json strings no?