-
-
Save kornypoet/2346656 to your computer and use it in GitHub Desktop.
| -- | |
| -- If your data looks like this: | |
| -- | |
| -- {"foo":1,"bar":1} | |
| -- {"foo":2,"bar":2} | |
| -- {"foo":3,"bar":3} | |
| -- | |
| -- Then write your store function this way; it will respect nested hashes and arrays if they are JSONed properly | |
| -- | |
| register wonderdog-1.0-SNAPSHOT.jar; | |
| data = LOAD '/path/to/data' AS (json:chararray); | |
| STORE data INTO 'es://index/obj?json=true' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage(); | |
| -- | |
| -- If your data looks like this: | |
| -- | |
| -- 1 {(foo),(bar),(baz)} | |
| -- 2 {(foo),(bar),(baz)} | |
| -- 3 {(foo),(bar),(baz)} | |
| -- | |
| -- Then you need to serialize the bag somehow beforehand and then use the tsv store function; | |
| -- My recommendation would be to do this before the LOAD, or else write a UDF to serialize the bag | |
| -- | |
| register wonderdog-1.0-SNAPSHOT.jar; | |
| data = LOAD '/path/to/data' AS (id:int,vals:bag{}); | |
| serialized = FOREACH data GENERATE id AS id, JsonizeBag(vals) AS (vals:chararray); | |
| STORE serialized INTO 'es://index/obj?json=false' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage(); |
Not necessarily. This was simply an illustration of a workaround to try to fit the situation. Can you give me an example of the data model that you are trying to create in Elasticsearch?
@rjurney, take a look at the "Query Parameters" section of the documentation page. Url parameters are being pulled from the location string. If the data you're storing is already serialized as json then you'll want the location string to contain the 'json=true' flag. Additionally, if the serialized json data contains a field you want to use as the id, perhaps it's called something like "my_id_field", then you'll want to put that in the location string as well, ie: "id=my_id_field". Otherwise elasticsearch will assign each new record an id.
Looking back at the source (I haven't touched it in quite a while) it appears that you're better off NOT using the alternative method since it treats every field as a string and doesn't handle complex types (eg. DataBag) or nulls properly.
@kornypoet, if you like, take a look at http://github.com/Ganglion/sounder/blob/master/udf/src/main/java/sounder/pig/json/ToJson.java. You could use the same logic there to convert an arbitrary tuple to a MapWritable. Even this is silly though (json to MapWritable to XContentBuilder to json again), ultimately ElasticSearchOutputFormat should be written to handle incoming json strings no?
Just so I understand, what ElasticSearchStorage() expects is data of the format:
(id:int, jsonData:chararray)
?