This is still a WIP.

Advanced MR using Riak Search 2.0 instead...

Prerequisites

Download and install Riak 2.0 and an appropriate JVM for your machine.
Download the Erlang load_data.erl escript from https://raw.githubusercontent.com/basho/basho_docs/master/source/data/load_data.erl
Download the goog.csv test data from https://github.com/basho/basho_docs/raw/master/source/data/goog.csv

Enable Search

While the node is stopped, modify the riak.conf file. Add the line

search = on

to the end of the file, or find and replace the value within the file. Once search is enabled on all of the nodes in your cluster, start them up. If the nodes fail to start with search enabled, verify that you have a Java Virtual Machine loaded and that the java command works as expected.

Create a Schema for the Data

Since the load script does not use column names that are compatible with the default schema and since using the default schema is not recommended for a production system anyhow, we are going to make an appropriate custom schema for the goog.csv data.

Start with the Skeleton Schema

Start with the skeleton Search schema documented in the Search Schema. Create a document in your working folder named goog.xml and insert the following content:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="schedule" version="1.5">
 <fields>

   <!-- All of these fields are required by Riak Search -->
   <field name="_yz_id"   type="_yz_str" indexed="true" stored="true"  multiValued="false" required="true"/>
   <field name="_yz_ed"   type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_pn"   type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_fpn"  type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_vtag" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
   <field name="_yz_rk"   type="_yz_str" indexed="true" stored="true"  multiValued="false"/>
   <field name="_yz_rt"   type="_yz_str" indexed="true" stored="true"  multiValued="false"/>
   <field name="_yz_rb"   type="_yz_str" indexed="true" stored="true"  multiValued="false"/>
   <field name="_yz_err"  type="_yz_str" indexed="true" stored="false" multiValued="false"/>
 </fields>

 <uniqueKey>_yz_id</uniqueKey>

 <types>
    <!-- YZ String: Used for non-analyzed fields -->
    <fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true" />
 </types>
</schema>

Add FieldTypes to the Schema

We will need to configure some additional Solr FieldTypes that we will use for our fields. Inside the <types> element, add the following types:

    <fieldtype name="date" class="solr.TrieDateField" />
    <fieldtype name="integer" class="solr.TrieLongField" />
    <fieldtype name="float" class="solr.TrieFloatField" />

    <!-- Catch-all Field Type -->
    <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

These types will be used in our field definitions that will be made in the next step.
The "catch-all" field type will be used on a Dynamic Field. Fields that are in the object, but not in the schema, will match the catch-all field definition and be ignored rather than throwing an error.

Adding Fields to the Schema

Next, we will define Solr Fields. These field definitions are used to tell Search what to do with incoming data—to index it, store it, or ignore it.

Inside the <fields> element, add the following types:

   <field name="date"     type="date"    indexed="true" stored="false" multiValued="false" />
   <field name="open"     type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="high"     type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="low"      type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="close"    type="float"   indexed="true" stored="false" multiValued="false" />
   <field name="volume"   type="integer" indexed="true" stored="false" multiValued="false" />

   <!-- Catch-all field -->
   <dynamicField name="*" type="ignored"  />

Configure Riak

Set up the shell environment

This just makes the following commands more copy-and-paste-able. Run the following commands in your shell, substituting the correct values where necessary.

export RIAK_HOST="http://localhost"
export RIAK_PORT=8098
export SOLR_PORT=8093

Load the Schema

curl -XPUT "$RIAK_HOST:$RIAK_PORT/search/schema/goog" \
 -H 'Content-Type:application/xml' --data-binary @goog.xml

Create the Search Index

curl -XPUT "$RIAK_HOST:$RIAK_PORT/search/index/goog" \
 -H 'Content-Type:application/json' -d '{"schema":"goog"}'

Add the Index to the `goog` Bucket

curl -XPUT "$RIAK_HOST:$RIAK_PORT/buckets/goog/props" -H'content-type:application/json' -d'{"props":{"search_index":"goog"}}'

Modify the Load Script

There are a few fixes that will ned to be made to the original load_data.erl file. First, if you do not have erlang installed on your machine, you will need to change the path in the first line of the file to point to the embedded Erlang included with Riak.

Change

#!/usr/bin/env escript

#!/usr/lib64/riak/erts-5.10.3/bin/escript

if you are on CentOS or RHEL, or

#!/usr/lib/riak/erts-5.10.3/bin/escript

for Ubuntu, Debian, and FreeBSD

Next, we need to make some changes to the curl that gets called by the the load script. We need to change the URL so that it reflects our Riak IP Address and Port and then create Java style dates so that they will be parseable by Riak Search. Change line 9 from

    JSON = io_lib:format("{\"Date\":\"~s\",\"Open\":~s,\"High\":~s,\"Low\":~s,\"Close\":~s,\"Volume\":~s,\"Adj. Close\":~s}", Line),

    JSON = io_lib:format("{\"date\":\"~sT00:00:00Z\",\"open\":~s,\"high\":~s,\"low\":~s,\"close\":~s,\"volume\":~s,\"adj_close\":~s}", Line),

Modify line 10 and correct the IP address and port if necessary—if you're using a devrel, for example:

    Command = io_lib:format("curl -X PUT http://127.0.0.1:8091/riak/goog/~s -d '~s' -H 'content-type: application/json'", [hd(Line),JSON]),

Load the Data

./load_data goog.csv

Running Example Queries

In this section, we will run some of the sample queries that were provided in the "Advanced MapReduce - Bigger Data Examples" documentation against the sample dataset.

Select days where the high was over $600

http://10.0.1.19:8098/search/query/goog?wt=json&q=high:[600%20TO%20*]

curl -g "$RIAK_HOST:$RIAK_PORT/search/query/goog?wt=json&q=high:[600%20TO%20*]" | jsonpp

Select days where the there was a loss.

curl "$RIAK_HOST:$RIAK_PORT/search/query/goog?q=*:*&fq=%7B%21frange+u%3D0%7Dsub%28close%2Copen%29&wt=json" | jsonpp

http://10.0.1.19:8098/search/query/goog?q=*:*&fq={!frange%20u=0}sub(close,open)&wt=json

Deleting an Index

Before deleting an index, it is important to note that there is no way to rebuild an index in realtime without clearing the AAE trees and waiting for Active Anti-Entropy to repair them.

curl -XPUT "$RIAK_HOST:$RIAK_PORT/buckets/goog/props" -H'content-type:application/json' -d'{"props":{"search_index":"_dont_index_"}}'
curl -XDELETE "$RIAK_HOST:$RIAK_PORT/search/index/goog"

angrycub/QueryingRiakSearch2.0.md